Table of Contents

This notebook presents the results of the Data Analysis Part of a Master Thesis conducted by Hugo Moreau (hugo.moreau@epfl.ch) within UPC Cablecom - Data management team. The project aimed at building a proof of concept regarding the use of Machine Learning in order to predict customer premises equipment failure.

It aims at providing a very thorough understanding of the different choices that had to be taken during this phase and will therefore contain some code repetition for the sake of explaining correctly our line of thought.

In [75]:
# Some important imports
%matplotlib inline
%load_ext autoreload
%autoreload 2

import numpy as np
import pandas as pd
pd.options.mode.chained_assignment = None
import matplotlib.pyplot as plt
import seaborn as sns
import itertools
from mpl_toolkits.mplot3d import Axes3D
sns.set_context('notebook')

# Sklearn imports
import sklearn
from sklearn import calibration
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import make_pipeline
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import train_test_split

# Some imports that may require package installation
try:
    import missingno as msno
except ModuleNotFoundError:
    print('You need to run: pip install missingno')

# Own Scripts import
from scripts.energy_test_DP import *
from scripts.utils import *
from scripts.preprocessing import *
from scripts.plot import *
from scripts.model_selection import *

# get rid of warning due to deprecated modules in sklearn
import warnings
warnings.simplefilter('ignore')

# Constants
DATA_FOLDER = './Data'
The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload

Intermediary Analysis

Influence of weekends

Before completed Data Collection, we decided that it would be interesting to look at whether we could find any influence of the fact that Day_0 is a weekend day or not on the vector of measurements. This would potentially allow us to decide to exclude such Day_0 from our following analysis.

Because data collection was not finished at such an early stage, we have decided to conduct the analysis on a subsample of the nodes. To to this we arbitrarily selected two service groups over a week ('ch-mbcALT301','ch-mbcARA301')

In [7]:
DATA_PATH = DATA_FOLDER+'/initial_non_aggregated.xlsx'

df = excel_to_df(DATA_PATH,DATA_FOLDER+'/extracted.pk')
df.head()
Retrieving from ./Data/extracted.pk
Out[7]:
MAC ACCOUNT_NUMBER CMTS SERVICE_GROUP FIRST_DAY LAST_DAY DAY_TYPE OFFLINE_PCT_6H OFFLINE_PCT_12H OFFLINE_PCT_18H OFFLINE_PCT_24H TX_UP_6H TX_UP_12H TX_UP_18H MISS_TX_UP_6H MISS_TX_UP_12H MISS_TX_UP_18H MISS_TX_UP_24H RX_UP_6H RX_UP_12H RX_UP_18H MISS_RX_UP_6H MISS_RX_UP_12H MISS_RX_UP_18H MISS_RX_UP_24H RX_DN_6H RX_DN_12H RX_DN_18H MISS_RX_DN_6H MISS_RX_DN_12H MISS_RX_DN_18H MISS_RX_DN_24H CER_DN_6H CER_DN_12H CER_DN_18H MISS_CER_DN_6H MISS_CER_DN_12H MISS_CER_DN_18H MISS_CER_DN_24H CER_UP_6H CER_UP_12H CER_UP_18H MISS_CER_UP_12H MISS_CER_UP_6H MISS_CER_UP_18H MISS_CER_UP_24H SNR_DN_6H SNR_DN_12H SNR_DN_18H MISS_SNR_DN_6H MISS_SNR_DN_12H MISS_SNR_DN_18H MISS_SNR_DN_24H SNR_UP_6H SNR_UP_12H SNR_UP_18H MISS_SNR_UP_6H MISS_SNR_UP_12H MISS_SNR_UP_18H MISS_SNR_UP_24H PCT_TRAFFIC_DMH_UP_6H PCT_TRAFFIC_DMH_UP_12H PCT_TRAFFIC_DMH_UP_18H MISS_PCT_TRAFFIC_DMH_UP_6H MISS_PCT_TRAFFIC_DMH_UP_12H MISS_PCT_TRAFFIC_DMH_UP_18H MISS_PCT_TRAFFIC_DMH_UP_24H PCT_TRAFFIC_SDMH_UP_6H PCT_TRAFFIC_SDMH_UP_12H PCT_TRAFFIC_SDMH_UP_18H MISS_PCT_TRAFFIC_SDMH_UP_6H MISS_PCT_TRAFFIC_SDMH_UP_12H MISS_PCT_TRAFFIC_SDMH_UP_18H MISS_PCT_TRAFFIC_SDMH_UP_24H
0 0CEEE6E1CD2C 5302149 ch-mbcALT301 RW01 11/04/2018 00:00:00 17/04/2018 00:00:00 WEEKDAY 0.0 0.0 0.0 0.0 -0.000775 -0.002660 -0.006353 0.000000 0.000000 0.000000 0.0 -0.005562 0.024849 -0.045483 0.0 0.0 0.000000 0.0 -0.017834 0.007838 0.006268 0.000000 0.000000 0.000000 0.0 0.002646 -0.000041 -0.005208 0.0 0.000000 0.000000 0.0 0.011372 -0.011020 -0.002651 0.0 0.0 0.000000 0.0 -0.019355 0.017161 0.009354 0.000000 0.000000 0.000000 0.0 0.016189 -0.027356 -0.035336 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 16.666667 16.666667 0.000000 0.000000 0.0 0.0 0.0 16.666667 16.666667 0.000000 0.000000
1 0CEEE6E1CD2C 5302149 ch-mbcALT301 RW01 11/04/2018 00:00:00 17/04/2018 00:00:00 WEEKDAY 0.0 0.0 0.0 0.0 -0.003552 -0.000431 -0.006589 0.000000 16.666667 16.666667 0.0 -0.025523 0.013196 -0.091040 0.0 0.0 16.666667 0.0 -0.000324 0.005300 -0.000680 0.000000 16.666667 16.666667 0.0 0.000000 0.000000 0.000000 0.0 16.666667 16.666667 0.0 -0.010319 -0.000910 -0.003129 0.0 0.0 16.666667 0.0 -0.008812 0.011431 0.023665 0.000000 16.666667 16.666667 0.0 -0.030405 0.011112 0.047698 0.0 0.0 16.666667 0.0 0.0 0.0 0.0 16.666667 0.000000 16.666667 0.000000 0.0 0.0 0.0 16.666667 0.000000 16.666667 0.000000
2 0CEEE6E1CD2C 5302149 ch-mbcALT301 RW01 11/04/2018 00:00:00 17/04/2018 00:00:00 WEEKEND 0.0 0.0 0.0 0.0 -0.000269 -0.018263 0.006472 16.666667 0.000000 0.000000 0.0 -0.012237 -0.060808 0.044667 0.0 0.0 0.000000 0.0 -0.011564 0.006603 0.007208 16.666667 0.000000 0.000000 0.0 -0.005250 0.000000 0.000000 0.0 0.000000 0.000000 0.0 0.003938 -0.006197 0.004825 0.0 0.0 0.000000 0.0 -0.021407 0.017842 0.011744 16.666667 0.000000 0.000000 0.0 0.001648 0.020785 -0.032002 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 16.666667 0.000000 16.666667 0.0 0.0 0.0 0.000000 16.666667 0.000000 16.666667
3 0CEEE6E1CD2C 5302149 ch-mbcALT301 RW01 11/04/2018 00:00:00 17/04/2018 00:00:00 WEEKEND 0.0 0.0 0.0 0.0 -0.000286 -0.003274 0.003641 0.000000 0.000000 0.000000 0.0 0.012216 0.043443 0.027098 0.0 0.0 0.000000 0.0 -0.006672 0.011216 0.003506 0.000000 0.000000 0.000000 0.0 0.000000 0.002564 -0.002564 0.0 0.000000 0.000000 0.0 0.004117 -0.018467 0.018970 0.0 0.0 0.000000 0.0 0.004476 0.009823 0.001266 0.000000 0.000000 0.000000 0.0 -0.050323 0.009915 -0.005099 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.000000 16.666667 0.000000 16.666667 0.0 0.0 0.0 0.000000 16.666667 0.000000 16.666667
4 0CEEE6E1CD2C 5302149 ch-mbcALT301 RW01 11/04/2018 00:00:00 17/04/2018 00:00:00 WEEKDAY 0.0 0.0 0.0 0.0 0.007693 -0.001378 -0.002988 0.000000 0.000000 0.000000 0.0 0.027318 -0.005571 -0.011758 0.0 0.0 0.000000 0.0 0.001290 0.000979 -0.004526 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 0.0 0.000000 0.000000 0.0 -0.008707 0.000730 -0.004060 0.0 0.0 0.000000 0.0 0.021555 -0.006862 -0.002567 0.000000 0.000000 0.000000 0.0 -0.038321 -0.008070 0.038797 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 16.666667 16.666667 0.000000 0.000000 0.0 0.0 0.0 16.666667 16.666667 0.000000 0.000000

There exist three types of information in the dataset:

  • static_infos: information concerning the CPE vector itself, the MAC address, account_number, CMTS and Service group, as well as information regarding the dates over which the vector was computed.
  • measurements_non_miss: the real indicators that are polled using SAA either from the CPE itself or from the CMTS.
  • measurements_miss: because we aggregate at the database level we wanted to keep the information regarding the missing measurement as they could be useful to the classifier at a later stage. We have therefore for each aggregate from measurements_non_miss added the proportion of values from the aggregates that were missing.
In [8]:
cols = list(df.columns)
static_infos = ['MAC','ACCOUNT_NUMBER','CMTS','SERVICE_GROUP','FIRST_DAY','LAST_DAY','DAY_TYPE']
measurements = [x for x in cols if x not in static_infos]
measurements_non_miss = [x for x in measurements if not('MISS' in x)]
print('We have collected {:d} measurements'.format(len(measurements)))
We have collected 67 measurements

Naive Approach: Kolmogorov-Smirnov statistic on 2 samples.

As an initial naive approach we considered a KS-test on each individual measurement. This test is not so meaningful as it doesn't consider our vector as a multi-variate random variable but takes each dimension separately. Nevertheless as an initial approach it could help us explore each measurement.

meaning of this test :

"If the K-S statistic is small or the p-value is high, then we cannot reject the hypothesis that the distributions of the two samples are the same."

In [10]:
weekday_df = df[df.DAY_TYPE == 'WEEKDAY']
weekend_df = df[df.DAY_TYPE == 'WEEKEND']
values_weekday = weekday_df['OFFLINE_PCT_6H'].values
values_weekend = weekend_df['OFFLINE_PCT_6H'].values
In [11]:
stats.ks_2samp(values_weekday, values_weekend)
Out[11]:
Ks_2sampResult(statistic=0.0031149509986742041, pvalue=0.99927681127409662)

On this example were we try to decide whether the distribution of OFFLINE_PCT_6H during the weekend and the week is different, we see that the p-value is almost maximal and we cannot reject the null hypothesis at any reasonable confidence level.

We will compute such statistic for all the different variable and explore the ones with the smallest p-value to see if the human eye can distinguish the two distributions.

In [21]:
test_results = get_ks_test_result(weekday_df,weekend_df,measurements)
test_results.head(10)
Out[21]:
statistic pvalue
measurement
MISS_PCT_TRAFFIC_DMH_UP_18H 0.397146 0.000000e+00
MISS_SNR_UP_18H 0.394048 0.000000e+00
MISS_SNR_DN_18H 0.372491 0.000000e+00
MISS_CER_UP_18H 0.364987 0.000000e+00
MISS_PCT_TRAFFIC_SDMH_UP_18H 0.397146 0.000000e+00
MISS_TX_UP_18H 0.372337 0.000000e+00
MISS_CER_DN_18H 0.389702 0.000000e+00
MISS_RX_DN_18H 0.372409 0.000000e+00
MISS_RX_UP_18H 0.394048 0.000000e+00
MISS_RX_DN_12H 0.144773 2.531803e-253
In [29]:
sorted_mes = list(test_results.index)

Mow that we can see that some p-value are fairly low (for exemple for MISS_SNR_UP_18H we can reject the null hypothesis with significance level less than $0.01\%$. We are sure that the distributions are different at $99.99\%$. Therefore we wanted to observe graphically how this translates in term of distribution.

In [50]:
plot_difference(sorted_mes[0],weekday_df,weekend_df)
In [51]:
plot_difference(sorted_mes[1],weekday_df,weekend_df)
In [52]:
plot_difference(sorted_mes[2],weekday_df,weekend_df)

Unfortunately, due to the number of days that are observed for week ends and for week days we cannot be sure that this result can actually be trusted as it is possible that it simply comes from the fact that we have unbalanced classes. But also we see that here because the MISS_* measurements are simply the ratio over 6h windows of missing measurements, it isn't a continuous variable it is hard to conclude on the ks_test (Also, from the documentation we can see that the test assumes continuous variables and therefore has non value here).

By observing the distributions we see that we have a pattern, for the 3 cases, for weekdays most of the samples are 0 while about $40\%$ of them are $16.7$ for weekends. which translates in having just one missing value over a 6h window which isn't necessarily a noticeable difference.

Therefore we decided to look at measurement that are real measurement and not just missing ratios.

In [41]:
test_results_no_miss = get_ks_test_result(weekday_df,weekend_df,measurements,with_miss_mes=False)
sorted_mes_no_miss = list(test_results_no_miss.index)
test_results_no_miss.head(3)
Out[41]:
statistic pvalue
measurement
CER_UP_12H 0.128066 2.547019e-198
CER_UP_6H 0.094043 3.869613e-107
CER_DN_18H 0.092828 2.120185e-104
In [55]:
plot_difference(sorted_mes_no_miss[0],weekday_df,weekend_df)
In [56]:
plot_difference(sorted_mes_no_miss[1],weekday_df,weekend_df)
In [58]:
plot_difference(sorted_mes_no_miss[2],weekday_df,weekend_df,n_bins=10000)

All these plots do not display a significant difference between the two dataset and especially the very last one. This is why we will try to look at the cumulative distribution

In [63]:
f, axes = plt.subplots(1, 2, figsize=(18, 5), sharex=True , sharey =True)
axes[0].set_xlim(-0.005,0.010)
axes[1].set_xlim(-0.005,0.010)
weekday_df['CER_DN_18H'].dropna().hist(cumulative=True, normed=1, bins=10000,histtype='step',ax=axes[0])
weekend_df['CER_DN_18H'].dropna().hist(cumulative=True, normed=1, bins=10000,histtype='step',color='red',ax=axes[1])
f.show()

But again the difference isn't so clear here and we cannot necessarily exclude the fact that these come from different distribution as this difference could arise from the lack of data.

Multivariate Two Sample tests

The problem with the previous test is that it considers only each dimension independently but we are interested in determining if the day_type influences our vector. Some links of interest :

  1. Modern two sample tests: lists a couple of methods that might become useful to run such test
  2. A Multivariate Two-Sample Test Based on the Concept of Minimum Energy
  3. Testing For Equal Distributions In High Dimension: Bootstrap and Energy distance
  4. A Test for Equality of Distributions in High Dimensions: Bootstrap and nearest neighbor distribution
  5. Testing for the Equality of two Distributions on High Dimensional Object Spaces

The third link seems to be a good approach for our use case. Nevertheless it will ask for some numpy tricks as it can be quite computationally expensive if it isn't done properly, therefore we use some Dynamic programming (in order to not recompute the euclidean distance between each vector multiple times) and numpy fancy indexing in order to use already implemented optimization.

All the functions used to perform what we will refer as the energy test are implemented in the energy_test_DP.py file.

Testing on dummy samples

Now we test briefly that the method does indeed detect different and equal sampling distributions

In [65]:
DISTANCE_MATRIX_FOLDER = 'Distance_matrices/'

# Here it shouldn't reject the null hypothesis as the two are sampled from the same distribution
path = DISTANCE_MATRIX_FOLDER + 'same_distrib'
x = np.stack([np.random.exponential(size=200),np.random.normal(size=200)])
y = np.stack([np.random.exponential(size=200),np.random.normal(size=200)])
E_distance_replicates = energy_two_sample_test(x,y,499,1,distance_matrix_bkp=path)
Computing the distance matrix...
Saving the distance matrix at: Distance_matrices/same_distrib.npy
[############################################################] 100.0% (499/499) 
We cannot reject the Null hypothesis (CL = 99%): p-value = 0.7575150300601202
	 observed = 0.9972094296243794 	 limit = 4.955482478884141
Execution time: 2.7933s
In [66]:
# While here it should reject for obvious reasons
path = DISTANCE_MATRIX_FOLDER + 'different_distrib'
x = np.stack([np.random.exponential(size=200),np.random.normal(size=200)])
y = np.stack([np.random.normal(size=200),np.random.normal(size=200)])
E_distance_replicates = energy_two_sample_test(x,y,499,1,distance_matrix_bkp=path)
Computing the distance matrix...
Saving the distance matrix at: Distance_matrices/different_distrib.npy
[############################################################] 100.0% (499/499) 
We reject the Null hypothesis (CL = 99%): p-value = 0.0
	 observed = 35.347305119278126 	 limit = 4.3015997447574525
Execution time: 3.03s

We can see that naive testing is conclusive with is encouraging regarding the successful implementation of the test discussed in the paper.

Running on day_type averages to have a contained dataset size

The size of the distance matrix on the full dataset makes it very long to run and hard to store which is why we first will compute two datasets of equal size (we average a given CPE vector over weekdays and weekend days)

In [67]:
weekday_df_avg = weekday_df.groupby(by=['MAC','ACCOUNT_NUMBER','CMTS','SERVICE_GROUP','FIRST_DAY','LAST_DAY']).mean()
weekend_df_avg = weekend_df.groupby(by=['MAC','ACCOUNT_NUMBER','CMTS','SERVICE_GROUP','FIRST_DAY','LAST_DAY']).mean()
In [68]:
path = DISTANCE_MATRIX_FOLDER + 'day_type_avg'
x = weekday_df_avg[measurements].dropna().values.T
y = weekend_df_avg[measurements].dropna().values.T
E_distance_replicates = energy_two_sample_test(x,y,99,1,distance_matrix_bkp=path)
#used to run in 504s when computing the distance matrix
Computing the distance matrix...
Saving the distance matrix at: Distance_matrices/day_type_avg.npy
[############################################################] 100.0% (99/99) 
We reject the Null hypothesis (CL = 99%): p-value = 0.0
	 observed = 87340.1873985218 	 limit = 71.00247756253034
Execution time: 563.9178s

The test rejects the null hypothesis and we can see that the bootstrapped energy estimates are on a much lower scale than the observed one supporting even more this result

In [69]:
E_distance_replicates.sort()
min_ = E_distance_replicates[0]
max_ = E_distance_replicates[-1]
print('Replicates range in: [{},{}]'.format(round(min_,3),round(max_,3)))
Replicates range in: [14.399,74.027]

We could challenge this result given the very high dimension of the vectors that could cause the euclidean dimension to overshoot.

Running on the full dataset with some tweaking

Running on the full dataset was not possible so rather than taking all the samples that are available to us we will sample from the full to obtain random samples from the weekday and weekend to perform the test efficiently.

The dataset has:

  • $44'556$ samples from the week
  • $17'740$ samples from the weekend

Because we sample we cannot store the distance matrix and need to recompute it each time.

In [70]:
x = weekday_df[measurements].dropna().values.T
y = weekend_df[measurements].dropna().values.T

# and we sample it without replacement
sample_size = 10000
x_samp = x[:,np.random.choice(x.shape[1],sample_size,replace=False)]
y_samp = y[:,np.random.choice(y.shape[1],sample_size,replace=False)]
In [71]:
E_distance_replicates = energy_two_sample_test(x_samp,y_samp,99,1,use_bkp=False)
Computing the distance matrix...
[############################################################] 100.0% (99/99) 
We reject the Null hypothesis (CL = 99%): p-value = 0.0
	 observed = 57526.90875528135 	 limit = 86.97295650292305
Execution time: 670.0426s

As a control we would like to run the same test but on the population that we suppose are from the same distribution.

In [72]:
week_sample = x[:,np.random.choice(x.shape[1],2*sample_size,replace=False)]
x_samp = week_sample[:,:sample_size]
y_samp = week_sample[:,sample_size:]
E_distance_replicates = energy_two_sample_test(x_samp,y_samp,99,1,use_bkp=False)
Computing the distance matrix...
[############################################################] 100.0% (99/99) 
We cannot reject the Null hypothesis (CL = 99%): p-value = 0.9393939393939394
	 observed = 21.14666469516635 	 limit = 182.54820695272315
Execution time: 552.9907s
In [73]:
sample_size = 5000
weekend_sample = y[:,np.random.choice(y.shape[1],2*sample_size,replace=False)]
x_samp = weekend_sample[:,:sample_size]
y_samp = weekend_sample[:,sample_size:]
E_distance_replicates = energy_two_sample_test(x_samp,y_samp,99,1,use_bkp=False)
Computing the distance matrix...
[############################################################] 100.0% (99/99) 
We cannot reject the Null hypothesis (CL = 99%): p-value = 0.7373737373737373
	 observed = 18.988653392337795 	 limit = 79.40241056401551
Execution time: 122.438s

We will now try to use a more meaningful test where we perform multiple test on the full dataset using multiple samples to obtain a more meaningful p-value averaging away the potential variations due to the dataset.

In [74]:
x = weekday_df[measurements].dropna().values.T
y = weekend_df[measurements].dropna().values.T
energy_two_sample_large_dataset(x,y)
[############################################################] 100.0% (990/990) 
We reject the Null hypothesis (CL = 99%): p-value has mean=0.0 and std=0.0 
Execution time: 5449.1725s
Out[74]:
([0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0],
 [57529.147254417287,
  58224.723553516335,
  59467.644129170832,
  58082.010742646606,
  59806.767724199031,
  58435.523788791637,
  60487.205204299455,
  58960.50947513846,
  58821.438710254006,
  57900.385110268078],
 [90.246665240738992,
  117.75661864641958,
  106.28354529298873,
  119.85151157191807,
  107.05732500772022,
  80.986189359624092,
  78.784117957921708,
  101.82868699681862,
  108.12004880726239,
  90.246348414944947])

We still reject the Null hypothesis. And we keep our control experiments where we run it on two samples from the same population and observe that the Null Hypothesis cannot be rejected.

In [75]:
energy_two_sample_large_dataset(x,None,sample_size=5000,similar=True)
[############################################################] 100.0% (990/990) 
We cannot reject the Null hypothesis (CL = 99%): p-value has mean=0.692929292929293 and std=0.18648335920636924 
Execution time: 1446.9748s
Out[75]:
([0.8888888888888888,
  0.2828282828282828,
  0.6060606060606061,
  0.9797979797979798,
  0.5656565656565656,
  0.7171717171717171,
  0.6565656565656566,
  0.7878787878787878,
  0.8282828282828283,
  0.6161616161616161],
 [25.509857345387132,
  49.056015886357329,
  36.194393423407689,
  20.683393574483233,
  37.822918119374549,
  31.322451072810509,
  33.674663080542899,
  30.86953615015986,
  25.953798364941605,
  33.854134473152442],
 [117.41975646357993,
  134.17087404820248,
  183.21851584668914,
  103.37075160068419,
  108.56052935772504,
  107.92265556343337,
  158.4726821922082,
  132.37129679444587,
  104.49963611562129,
  156.08000297610224])
In [76]:
energy_two_sample_large_dataset(y,None,sample_size=5000,similar=True)
[############################################################] 100.0% (990/990) 
We cannot reject the Null hypothesis (CL = 99%): p-value has mean=0.4363636363636364 and std=0.3334802247509879 
Execution time: 1320.4287s
Out[76]:
([0.9292929292929293,
  0.43434343434343436,
  0.8686868686868687,
  0.3939393939393939,
  0.29292929292929293,
  0.1111111111111111,
  0.1111111111111111,
  0.2828282828282828,
  0.9292929292929293,
  0.010101010101010102],
 [13.663423334486779,
  28.175375279415604,
  16.208112084452608,
  29.369360299540404,
  34.945395008056934,
  48.187165233626317,
  41.106523613700929,
  40.2510529053135,
  14.753401745082328,
  81.606493951156267],
 [81.108365654060506,
  79.892474159494697,
  88.375261200968993,
  61.322882720597747,
  71.590887435745074,
  61.788304900497458,
  83.615815255926407,
  80.3765982337585,
  76.120310086298517,
  80.442215281406064])

Conclusion of the test

The test seems to indicate that the time of the week of Day_0 does indeed influence the vectors of aggregate on the last 24h (measurements regarding Day_0). We will nevertheless consider all days as equivalent to be selected as Day_0 and will at a later stage decide whether to keep them or not. Once we will be classifying, such choice will show results that are more easily interpretable.

Because we will be able to have meaningful insights regarding how getting rid of those can improve the scores.

Exploratory Data Analysis

Thanks to a framework provided by Business Intelligence, we didn't have to perform the usual plotting of all the 291 features as a report (that is linked to the thesis) can be automatically generated

In [8]:
# We set some pandas options to improve readability when a dataframe is pretty printed.
pd.set_option('max_columns', 290)
pd.set_option('max_colwidth', 5000)
pd.set_option('display.date_dayfirst', True)

Because the project is done while data keeps being collected, we develloped some tools that allow us to import easily the data that we extract from our Backup table.

Importing data

As it can be observed in the signature of import_sample, the helper function will read the excel file containing the sample, convert each feature to the correct formats, convert the hardware models to indices, keep the vectors from dates where there exist at least one sick cpe, but most importantly it will serialize the result to allow for faster future imports

In [9]:
DATA_PATH = DATA_FOLDER + '/sample_27_04.xlsx'

# We use the help function in order to obtain the dataframe in a correct format.
df = import_sample(DATA_PATH,DATA_FOLDER+'/sample_27_04.pk')
df.head()
Reading ./Data/sample_27_04.xlsx
Performing some transformation
Saving to ./Data/sample_27_04_sick_only.pk
The sample is composed of : 7106 vectors of dimension 291
	n_sick		=   1164
	n_healthy	=   5942
Out[9]:
weekday day_0 mac cly_account_number saa_account_number cmts service_group hardware_model n_cpe_building offline_pct_6h offline_pct_12h offline_pct_18h offline_pct_24h offline_pct offline_pct_1d offline_pct_2d offline_pct_3d offline_pct_4d offline_pct_5d cer_dn_6h cer_dn_12h cer_dn_18h cer_dn cer_dn_1d cer_dn_2d cer_dn_3d cer_dn_4d cer_dn_5d miss_cer_dn_6h miss_cer_dn_12h miss_cer_dn_18h miss_cer_dn_24h miss_cer_dn miss_cer_dn_1d miss_cer_dn_2d miss_cer_dn_3d miss_cer_dn_4d miss_cer_dn_5d cer_up_6h cer_up_12h cer_up_18h cer_up cer_up_1d cer_up_2d cer_up_3d cer_up_4d cer_up_5d miss_cer_up_6h miss_cer_up_12h miss_cer_up_18h miss_cer_up_24h miss_cer_up miss_cer_up_1d miss_cer_up_2d miss_cer_up_3d miss_cer_up_4d miss_cer_up_5d pct_traffic_dmh_up_6h pct_traffic_dmh_up_12h pct_traffic_dmh_up_18h pct_traffic_dmh_up pct_traffic_dmh_up_1d pct_traffic_dmh_up_2d pct_traffic_dmh_up_3d pct_traffic_dmh_up_4d pct_traffic_dmh_up_5d miss_pct_traffic_dmh_up_6h miss_pct_traffic_dmh_up_12h miss_pct_traffic_dmh_up_18h miss_pct_traffic_dmh_up_24h miss_pct_traffic_dmh_up miss_pct_traffic_dmh_up_1d miss_pct_traffic_dmh_up_2d miss_pct_traffic_dmh_up_3d miss_pct_traffic_dmh_up_4d miss_pct_traffic_dmh_up_5d pct_traffic_sdmh_up_6h pct_traffic_sdmh_up_12h pct_traffic_sdmh_up_18h pct_traffic_sdmh_up pct_traffic_sdmh_up_1d pct_traffic_sdmh_up_2d pct_traffic_sdmh_up_3d pct_traffic_sdmh_up_4d pct_traffic_sdmh_up_5d miss_pct_traffic_sdmh_up_6h miss_pct_traffic_sdmh_up_12h miss_pct_traffic_sdmh_up_18h miss_pct_traffic_sdmh_up_24h miss_pct_traffic_sdmh_up miss_pct_traffic_sdmh_up_1d miss_pct_traffic_sdmh_up_2d miss_pct_traffic_sdmh_up_3d miss_pct_traffic_sdmh_up_4d miss_pct_traffic_sdmh_up_5d rx_dn_6h rx_dn_12h rx_dn_18h rx_dn rx_dn_1d rx_dn_2d rx_dn_3d rx_dn_4d rx_dn_5d miss_rx_dn_6h miss_rx_dn_12h miss_rx_dn_18h miss_rx_dn_24h miss_rx_dn miss_rx_dn_1d miss_rx_dn_2d miss_rx_dn_3d miss_rx_dn_4d miss_rx_dn_5d rx_up_6h rx_up_12h rx_up_18h rx_up rx_up_1d rx_up_2d rx_up_3d rx_up_4d rx_up_5d miss_rx_up_6h miss_rx_up_12h miss_rx_up_18h miss_rx_up_24h miss_rx_up miss_rx_up_1d miss_rx_up_2d miss_rx_up_3d miss_rx_up_4d miss_rx_up_5d snr_dn_6h snr_dn_12h snr_dn_18h snr_dn snr_dn_1d snr_dn_2d snr_dn_3d snr_dn_4d snr_dn_5d miss_snr_dn_6h miss_snr_dn_12h miss_snr_dn_18h ... miss_snr_dn miss_snr_dn_1d miss_snr_dn_2d miss_snr_dn_3d miss_snr_dn_4d miss_snr_dn_5d snr_up_6h snr_up_12h snr_up_18h snr_up snr_up_1d snr_up_2d snr_up_3d snr_up_4d snr_up_5d miss_snr_up_6h miss_snr_up_12h miss_snr_up_18h miss_snr_up_24h miss_snr_up miss_snr_up_1d miss_snr_up_2d miss_snr_up_3d miss_snr_up_4d miss_snr_up_5d tx_up_6h tx_up_12h tx_up_18h tx_up tx_up_1d tx_up_2d tx_up_3d tx_up_4d tx_up_5d miss_tx_up_6h miss_tx_up_12h miss_tx_up_18h miss_tx_up_24h miss_tx_up miss_tx_up_1d miss_tx_up_2d miss_tx_up_3d miss_tx_up_4d miss_tx_up_5d cmts_rx_up_6h cmts_rx_up_12h cmts_rx_up_18h cmts_rx_up cmts_rx_up_1d cmts_rx_up_2d cmts_rx_up_3d cmts_rx_up_4d cmts_rx_up_5d cmts_tx_up_6h cmts_tx_up_12h cmts_tx_up_18h cmts_tx_up cmts_tx_up_1d cmts_tx_up_2d cmts_tx_up_3d cmts_tx_up_4d cmts_tx_up_5d cmts_cer_up_6h cmts_cer_up_12h cmts_cer_up_18h cmts_cer_up cmts_cer_up_1d cmts_cer_up_2d cmts_cer_up_3d cmts_cer_up_4d cmts_cer_up_5d cmts_utilization_up_6h cmts_utilization_up_12h cmts_utilization_up_18h cmts_utilization_up cmts_utilization_up_1d cmts_utilization_up_2d cmts_utilization_up_3d cmts_utilization_up_4d cmts_utilization_up_5d cmts_ms_utilization_up_6h cmts_ms_utilization_up_12h cmts_ms_utilization_up_18h cmts_ms_utilization_up cmts_ms_utilization_up_1d cmts_ms_utilization_up_2d cmts_ms_utilization_up_3d cmts_ms_utilization_up_4d cmts_ms_utilization_up_5d cmts_f_ms_utilization_up_6h cmts_f_ms_utilization_up_12h cmts_f_ms_utilization_up_18h cmts_f_ms_utilization_up cmts_f_ms_utilization_up_1d cmts_f_ms_utilization_up_2d cmts_f_ms_utilization_up_3d cmts_f_ms_utilization_up_4d cmts_f_ms_utilization_up_5d cmts_rx_dn_6h cmts_rx_dn_12h cmts_rx_dn_18h cmts_rx_dn cmts_rx_dn_1d cmts_rx_dn_2d cmts_rx_dn_3d cmts_rx_dn_4d cmts_rx_dn_5d cmts_snr_dn_6h cmts_snr_dn_12h cmts_snr_dn_18h cmts_snr_dn cmts_snr_dn_1d cmts_snr_dn_2d cmts_snr_dn_3d cmts_snr_dn_4d cmts_snr_dn_5d cmts_ccer_dn_6h cmts_ccer_dn_12h cmts_ccer_dn_18h cmts_ccer_dn cmts_ccer_dn_1d cmts_ccer_dn_2d cmts_ccer_dn_3d cmts_ccer_dn_4d cmts_ccer_dn_5d cmts_cer_dn_6h cmts_cer_dn_12h cmts_cer_dn_18h cmts_cer_dn cmts_cer_dn_1d cmts_cer_dn_2d cmts_cer_dn_3d cmts_cer_dn_4d cmts_cer_dn_5d cmts_utilization_dn_6h cmts_utilization_dn_12h cmts_utilization_dn_18h cmts_utilization_dn cmts_utilization_dn_1d cmts_utilization_dn_2d cmts_utilization_dn_3d cmts_utilization_dn_4d cmts_utilization_dn_5d seq_id milestone_name
0 4 2018-04-06 0CEEE6E2C2AC 5639672 5639672 ch-mbcBRN301 RW26 1 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.001031 0.000415 -0.001446 0.000000 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 0.000105 -0.000420 0.001789 -0.007967 0.000785 -0.001184 -0.001190 0.004950 -0.002379 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 0.000000 0.0 16.666667 8.333333 8.333333 8.333333 12.500000 8.333333 8.333333 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 0.000000 0.0 16.666667 8.333333 8.333333 8.333333 12.500000 8.333333 8.333333 -0.005345 -0.002388 -0.000456 0.061741 -0.001555 0.001497 0.001822 -0.002989 -0.001464 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 0.003899 -0.005362 0.000274 0.026256 0.001061 -0.000601 -0.002986 -0.004179 0.003344 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 -0.006647 0.001534 -0.027039 0.273161 0.054835 0.018959 -0.040272 -0.014389 -0.010505 0.0 0.0 0.000000 ... 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 0.033163 -0.029787 -0.003332 0.029677 -0.017459 -0.004426 0.015933 -0.014446 0.000158 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 -0.000321 -0.000081 0.000346 0.010331 -0.000431 -0.000068 -0.000055 0.001647 -0.000006 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 0.000000 -0.016667 -0.004167 5.884375 -0.012500 -0.003125 0.026042 -0.017708 0.004167 0.0500 -0.020833 -0.016667 44.935417 -0.038542 0.010417 -0.001042 -0.078125 0.005208 -0.001250 -0.002083 0.003750 0.001875 -0.001875 0.002500 -0.000208 0.000521 -0.002292 3.788333 1.967500 -3.007917 4.454479 -3.493646 2.175312 2.395417 -1.281146 -0.640104 5.062735 2.748307 -3.984669 6.522815 -4.008310 2.440581 2.790888 -1.562189 -0.501645 -0.244566 -0.225514 0.099700 99.479373 0.073695 -0.017645 -0.020231 0.047314 0.046982 -0.017708 -0.094792 0.057292 2.047396 0.019271 -0.004687 0.005729 -0.045312 -0.017969 -0.011458 -0.055208 0.020833 38.611198 0.002344 0.010937 -0.007812 -0.025260 -0.008073 0.001875 0.001354 -0.004583 0.008229 -0.006562 0.001042 0.008281 0.000573 0.002422 -6.000000e-42 -0.000208 -0.004583 0.003229 -0.005234 0.004271 0.001224 -0.000052 0.002708 3.980104 8.752187 -5.997604 19.181042 -0.097292 -2.119714 2.052917 -0.731016 -0.575755 9668 NaN
1 4 2018-04-06 0CEEE6E2E0A4 542870 542870 ch-mbcREB301 RW17 1 4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000285 -0.000285 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.002670 -0.005488 0.002871 -0.012676 0.001675 -0.001926 0.000629 -0.006549 0.004942 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 0.000000 0.0 16.666667 8.333333 8.333333 8.333333 12.500000 8.333333 8.333333 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 0.000000 0.0 16.666667 8.333333 8.333333 8.333333 12.500000 8.333333 8.333333 0.002894 -0.003055 0.001101 -0.144205 0.003849 0.005553 -0.002262 -0.008175 -0.002979 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 8.333333 8.333333 0.0 -0.009200 0.021629 -0.005211 0.050076 0.020736 -0.005638 0.005619 0.001284 0.000533 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.021969 -0.018959 0.029619 0.172375 -0.014162 0.024151 -0.008988 -0.018325 -0.005334 0.0 0.0 0.000000 ... 0.000000 0.000000 4.166667 8.333333 8.333333 0.0 0.002664 0.017235 0.002934 0.089762 0.004968 0.048268 -0.014768 -0.014723 -0.003701 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 -0.001803 -0.008496 0.004057 -0.031393 0.000538 0.001881 0.008057 -0.000337 -0.003932 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 8.333333 8.333333 0.0 0.137500 -0.266667 0.033333 5.901042 -0.015625 0.041667 0.031250 -0.048958 -0.037500 0.0375 0.154167 -0.062500 44.839583 0.015625 -0.047917 -0.045833 0.082292 0.064583 0.006250 0.002083 -0.002500 0.003646 -0.000104 -0.002500 0.000208 -0.000521 0.001146 4.173750 -0.140000 -3.222500 6.006563 -1.780938 1.416042 0.399792 -1.953333 1.108854 5.364180 0.370027 -4.159503 8.309745 -2.201554 1.712914 0.656630 -2.529155 1.270114 -0.391832 -0.084261 0.030546 99.382566 0.022961 -0.029124 -0.050264 0.130901 -0.051303 0.079167 -0.768750 0.173958 1.697135 -0.105729 0.224219 -0.043750 -0.380208 -0.178385 0.026042 -0.141667 0.023958 38.624479 -0.014583 0.017969 -0.002865 -0.039844 -0.029948 0.002188 -0.008229 -0.002083 0.022682 -0.007474 -0.004010 0.004036 -0.005156 0.001615 8.333333e-04 0.002500 -0.000313 0.008828 0.000417 0.001094 -0.000208 0.003646 0.002422 23.499583 9.772917 -2.301563 32.523828 -2.533880 -4.551589 7.615703 -9.699922 -0.953333 492291 NaN
2 4 2018-04-06 0024D1D12417 5829582 5829582 mbcNCH301 RW36 6 18 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.010985 -0.010985 0.000000 0.0 0.0 16.666667 0.0 4.166667 4.166667 4.166667 8.333333 8.333333 0.0 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.001488 0.000055 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 0.000000 0.0 0.000000 4.166667 4.166667 8.333333 8.333333 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 0.000000 0.0 0.000000 4.166667 4.166667 8.333333 8.333333 4.166667 4.166667 -0.002633 0.024696 -0.017377 -0.043930 -0.000763 0.018804 0.006905 0.013588 0.000866 0.0 0.0 16.666667 0.0 4.166667 4.166667 4.166667 8.333333 8.333333 0.0 -0.067935 -0.018669 0.002603 0.089187 -0.025663 0.005616 0.021912 0.011321 -0.020633 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.000759 0.047562 0.002865 0.287527 0.010987 -0.001424 0.003435 -0.021793 0.004609 0.0 0.0 16.666667 ... 4.166667 4.166667 4.166667 8.333333 8.333333 0.0 -0.169661 0.549638 -0.252859 -0.195336 0.108450 -0.215384 0.134303 -0.291086 0.078745 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.001051 -0.015901 -0.007290 -0.088703 0.010498 -0.010610 -0.010596 -0.019229 -0.006426 0.0 0.0 16.666667 0.0 4.166667 4.166667 4.166667 8.333333 8.333333 0.0 -0.033333 -0.020833 -0.008333 5.783333 -0.015625 0.008333 0.005208 -0.025000 -0.029167 0.0375 0.090000 0.060000 48.399375 -0.066250 0.080208 0.053125 0.062500 0.026458 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.000104 0.271250 -0.061667 -4.940000 3.831562 0.022083 -0.056875 1.616875 -19.854479 14.022813 0.343252 -0.208725 -5.613827 4.817970 0.065533 -0.188166 2.176558 -21.414273 15.279563 0.009974 -0.032722 0.053995 99.845836 -0.029919 0.017721 -0.040697 0.047354 0.005583 -0.026042 -0.339375 0.060208 -0.492500 -0.120365 -0.161979 0.122344 -0.061146 0.003333 0.006250 -0.102917 0.012292 37.902031 -0.007083 -0.027448 0.017135 -0.019375 0.012448 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 -0.000052 0.000052 0.000000 0.000000e+00 0.000000 0.000000 0.000000 0.000000 0.000000 -0.000260 0.000260 0.000000 3.114688 -2.334771 0.255812 15.212833 -6.653469 -23.596635 26.616219 3.024948 -0.840432 81 NaN
3 4 2018-04-06 002624C5CC64 1885940 1885940 ch-mbcZOE301 RW102 6 27 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 -0.001569 0.000000 0.000000 -0.000392 -0.000392 0.000195 0.001771 -0.001966 0.000000 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.002963 0.001084 -0.000312 -0.006184 0.001231 -0.000526 -0.000689 -0.000116 -0.000245 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 16.666667 0.0 16.666667 12.500000 12.500000 8.333333 8.333333 8.333333 12.500000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 16.666667 16.666667 0.0 16.666667 12.500000 12.500000 8.333333 8.333333 8.333333 12.500000 0.001464 0.005854 -0.008281 0.169967 -0.001865 -0.000569 -0.000145 0.003509 0.001736 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.005095 0.001666 0.005599 0.023898 -0.003981 -0.004791 -0.001222 0.004806 0.006124 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.004130 0.000852 -0.004990 0.252526 0.003216 0.003217 -0.003391 -0.003555 0.002196 0.0 0.0 0.000000 ... 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.020819 0.017170 -0.005697 0.037806 -0.029859 -0.010181 0.023891 0.012977 -0.049460 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 -0.001041 0.000686 0.000362 -0.373918 -0.000365 -0.000199 -0.001732 -0.000860 0.004171 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.020833 0.020833 -0.016667 5.963542 0.005208 0.007292 -0.007292 -0.007292 -0.001042 -0.0125 0.004167 -0.008333 45.134375 -0.004167 -0.002083 0.017708 -0.029167 -0.036458 0.000000 -0.000417 0.000417 0.000104 -0.000313 0.000313 0.000104 0.000000 -0.000104 3.797917 1.949167 0.114167 6.253854 -1.154688 -6.997292 -8.048646 -2.418438 16.429271 5.164868 2.641652 0.076048 8.784105 -1.884735 -7.109656 -10.508968 -2.344706 18.938263 -0.505715 -0.259857 0.067081 99.260027 0.055146 0.060291 0.016195 0.225488 -0.299779 0.153125 0.072917 -0.287500 -1.486719 0.075521 -0.129427 -0.049740 0.098958 0.062500 0.032292 0.004167 -0.057292 37.267187 0.009635 -0.011979 -0.007292 0.008333 0.008854 -0.045625 0.050000 0.002604 0.025755 0.009896 -0.022812 0.015182 -0.017422 -0.014609 -1.239583e-02 0.015938 -0.002500 0.006745 0.002005 -0.005104 -0.000260 0.006797 -0.000599 14.912917 2.899167 -3.569688 16.300547 -9.502370 7.744635 0.469167 -2.777917 -0.180990 360 NaN
4 4 2018-04-06 002624C80E50 5913213 5913213 ch-mbcZFO301 RW26 6 11 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000017 -0.002407 -0.002433 -0.003024 0.001543 0.000249 -0.002163 -0.000527 -0.000885 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.002198 -0.000218 0.001829 -0.010340 -0.000495 0.000347 0.001285 -0.002424 -0.000458 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 8.333333 8.333333 4.166667 4.166667 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.000000 0.0 0.000000 0.000000 0.000000 8.333333 8.333333 4.166667 4.166667 -0.026469 -0.002170 0.047426 -0.158618 0.009155 -0.001517 -0.000608 -0.003414 0.000587 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 0.034511 -0.021355 -0.036921 0.038414 -0.012551 0.003655 0.030213 -0.004390 0.010026 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.019134 -0.019392 -0.036425 0.316751 -0.020945 0.010921 0.009698 -0.040294 0.016402 0.0 0.0 0.000000 ... 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.004442 -0.002673 -0.028502 0.027842 -0.017704 0.007031 0.013498 0.002361 -0.022136 0.0 0.0 0.0 0.0 0.0 0.0 4.166667 4.166667 4.166667 0.0 0.008753 0.000226 -0.003442 -0.252612 -0.009583 -0.002170 0.010877 -0.008924 0.001620 0.0 0.0 0.000000 0.0 0.000000 0.000000 4.166667 4.166667 4.166667 0.0 -0.004167 -0.112500 0.054167 5.929167 0.001042 0.006250 0.037500 -0.037500 -0.020833 0.0375 0.041667 0.012500 39.689583 0.001042 0.023958 -0.002083 0.037500 -0.003125 0.005833 -0.005417 0.005833 0.007292 0.000104 -0.005521 0.006042 -0.002292 -0.029375 0.329167 2.116667 -4.310417 5.829479 -1.783229 0.322396 -1.199792 1.946146 0.922396 0.377154 2.756506 -4.318874 7.526392 -2.500859 0.495668 -1.341749 2.243635 1.067588 -0.179028 -0.149524 0.184933 99.577481 0.003323 0.036885 -0.041462 0.067223 -0.037268 -0.032292 -0.383333 0.031250 6.688281 -0.065625 0.037760 0.082552 -0.284635 -0.061979 0.034375 0.263542 -0.054167 38.574740 -0.000521 0.013802 -0.049740 0.098958 0.009375 -0.007396 0.001458 0.008021 0.022083 -0.002370 0.003307 -0.002526 0.006458 -0.002995 -4.166667e-04 0.012812 -0.005000 0.020573 -0.007865 0.002865 0.012578 0.001849 0.004609 9.718854 5.665208 -8.354583 15.336380 0.912396 -0.583828 1.347005 -1.268281 0.761719 669 NaN

5 rows × 291 columns

In [66]:
# We also identify the columns that are going to be used in later stages.
non_feature_cols = ['mac','day_0',
                    'cly_account_number',
                    'saa_account_number',
                    'cmts','service_group',
                    'seq_id','milestone_name']
feature_cols = [x for x in list(df.columns) if x not in non_feature_cols]

Null values

In [11]:
x = df[feature_cols]
null_counts = 100*x.isnull().sum()/x.count()
print("Number of null values in each column:\n{}".format(null_counts.sort_values(ascending=False)[:6]))
Number of null values in each column:
cer_up_6h     13.932981
cer_up_12h    11.501648
cer_up_18h    10.858034
rx_dn_6h       8.637823
tx_up_6h       8.588020
snr_dn_6h      8.588020
dtype: float64

We want to look at missing values. The module missingno allows us to visualize the missing values as white lines on a black matrix. Nevertheless it is complicated to use such package with such a dataset as we have a lot of features making it hard to see anything.

In [15]:
msno.matrix(x)
Out[15]:
<matplotlib.axes._subplots.AxesSubplot at 0x1a2184ff60>

In this matrix we can see in white the missing values, and we can see that somehow some features are NULL together which may be an interesting pattern. Nevertheless, considering the number of columns in our DataFrame it is fairly hard to study such pattern

For large datasets, it might be hard to investigate on trends for missing values, therefore we use a dendrogram, the missingno package is said to:

allow[...] you to more fully correlate variable completion, revealing trends deeper than the pairwise ones visible in the correlation heat-map.

and also regarding its interpretation:

To interpret this graph, read it from a top-down perspective. Cluster leaves which linked together at a distance of zero fully predict one another's presence—one variable might always be empty when another is filled, or they might always both be filled or both empty, and so on.

In [16]:
fig = msno.dendrogram(x)

What is reassuring regarding this dendogram is that we can see that most of the leaves group together measurements that are aggregated over the time period. (e.g. all aggregates over $day_{-4}$ are grouped together: *_4d ). This shows that as expected when a router is offline most of the corresponding features on that particular day are missing (we say most because some measurements are also polled at the cmts level and concern the cmts not only the CPE).

Feature selection

We would also like to look at the features that are highly correlated, given the very high dimensionality of the dataset this could potentially allow us to reduce the dimension of the vectors if we find features that are sufficiently correlated.

To perform such analysis we can use the most recent sample that is available.

In [17]:
s_day = '11_05'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'

extracted = import_sample(DATA_PATH,BACKUP_PATH)
dates = extracted['day_0'].values

# Keep only the features that are used for prediction (delete the MAC, Service Group etc.)
x_extracted, y_extracted = get_ml_data(extracted)

# Then we perform some initial transformation on the data that will be explained at a later time
x_encoded = encode_categorical(x_extracted)
x_no_missing = impute_missing(x_encoded,method='mean')
Reading ./Data/sample_11_05.xlsx
Performing some transformation
Saving to ./Data/sample_11_05_sick_only.pk
The sample is composed of : 10162 vectors of dimension 291
	n_sick		=   2584
	n_healthy	=   7578

An initial analysis highlighted that some columns were exactly similar: during data collection, an error was made and a value was saved under two different columns for many measurements. Therefore in the dataset it translates in columns that are exactly identical.

In [18]:
dropped_features = []
correlated_lists,identical_lists = find_correlation(x_no_missing)

Looking at identical features

In [19]:
identical_lists
Out[19]:
{'miss_cer_dn_1d': {'miss_cer_dn'},
 'miss_cer_up_1d': {'miss_cer_up'},
 'miss_pct_traffic_dmh_up_1d': {'miss_pct_traffic_dmh_up'},
 'miss_pct_traffic_sdmh_up_1d': {'miss_pct_traffic_sdmh_up'},
 'miss_rx_dn_1d': {'miss_rx_dn'},
 'miss_rx_up_1d': {'miss_rx_up'},
 'miss_snr_dn_1d': {'miss_snr_dn'},
 'miss_snr_up_1d': {'miss_snr_up'},
 'miss_tx_up_1d': {'miss_tx_up'}}

The error that seems to have happened is the fact that we have added a miss_*_1d for which we already have a feature called miss_* for all the 9 cpe measurements. Therefore deleting them is non destructive and can be done without much thinking.

The choice of which to drop is purely arbitrary and we will keep the feature with the shortest name as the only condition.

In [20]:
to_drop = []
for k in identical_lists:
    v = list(identical_lists[k])[0]
    drop = v if (len(k) < len(v) and k not in to_drop) else k
    to_drop.append(drop)
dropped_features = dropped_features + to_drop

x_no_missing = x_no_missing.drop(labels=to_drop,axis=1)

Now, we should have deleted all identical features from the dataset and we can run the analysis once more to have the highly correlated features among the remaining ones.

In [21]:
correlated_lists,identical_lists = find_correlation(x_no_missing)
print('We have {} identical feature-pairs'.format(len(identical_lists)))
We have 0 identical feature-pairs

Looking at non identical features

Now that all identical features have been deleted we can look into highly correlated features.

In [22]:
print('We have {} groups of correlated features'.format(len(correlated_lists)))
We have 48 groups of correlated features
In [23]:
correlated_lists
Out[23]:
{'cmts_ms_utilization_up': {'cmts_utilization_up'},
 'cmts_ms_utilization_up_12h': {'cmts_utilization_up_12h'},
 'cmts_ms_utilization_up_18h': {'cmts_utilization_up_18h'},
 'cmts_ms_utilization_up_1d': {'cmts_utilization_up_1d'},
 'cmts_ms_utilization_up_2d': {'cmts_utilization_up_2d'},
 'cmts_ms_utilization_up_3d': {'cmts_utilization_up_3d'},
 'cmts_ms_utilization_up_4d': {'cmts_utilization_up_4d'},
 'cmts_ms_utilization_up_5d': {'cmts_utilization_up_5d'},
 'cmts_ms_utilization_up_6h': {'cmts_utilization_up_6h'},
 'miss_pct_traffic_sdmh_up': {'miss_pct_traffic_dmh_up'},
 'miss_pct_traffic_sdmh_up_12h': {'miss_pct_traffic_dmh_up_12h'},
 'miss_pct_traffic_sdmh_up_18h': {'miss_pct_traffic_dmh_up_18h'},
 'miss_pct_traffic_sdmh_up_24h': {'miss_pct_traffic_dmh_up_24h'},
 'miss_pct_traffic_sdmh_up_2d': {'miss_pct_traffic_dmh_up_2d'},
 'miss_pct_traffic_sdmh_up_3d': {'miss_pct_traffic_dmh_up_3d'},
 'miss_pct_traffic_sdmh_up_4d': {'miss_pct_traffic_dmh_up_4d'},
 'miss_pct_traffic_sdmh_up_5d': {'miss_pct_traffic_dmh_up_5d'},
 'miss_pct_traffic_sdmh_up_6h': {'miss_pct_traffic_dmh_up_6h'},
 'miss_rx_up': {'miss_rx_dn'},
 'miss_rx_up_24h': {'offline_pct_24h'},
 'miss_rx_up_6h': {'offline_pct_6h'},
 'miss_snr_dn': {'miss_rx_dn', 'miss_rx_up'},
 'miss_snr_dn_12h': {'miss_rx_dn_12h'},
 'miss_snr_dn_18h': {'miss_rx_dn_18h'},
 'miss_snr_dn_24h': {'miss_rx_dn_24h'},
 'miss_snr_dn_2d': {'miss_rx_dn_2d'},
 'miss_snr_dn_3d': {'miss_rx_dn_3d'},
 'miss_snr_dn_4d': {'miss_rx_dn_4d'},
 'miss_snr_dn_5d': {'miss_rx_dn_5d'},
 'miss_snr_dn_6h': {'miss_rx_dn_6h'},
 'miss_snr_up': {'miss_rx_dn', 'miss_rx_up', 'miss_snr_dn'},
 'miss_snr_up_12h': {'miss_rx_up_12h'},
 'miss_snr_up_18h': {'miss_rx_up_18h'},
 'miss_snr_up_24h': {'miss_rx_up_24h', 'offline_pct_24h'},
 'miss_snr_up_2d': {'miss_rx_up_2d'},
 'miss_snr_up_3d': {'miss_rx_up_3d'},
 'miss_snr_up_4d': {'miss_rx_up_4d'},
 'miss_snr_up_5d': {'miss_rx_up_5d'},
 'miss_snr_up_6h': {'miss_rx_up_6h', 'offline_pct_6h'},
 'miss_tx_up': {'miss_rx_dn', 'miss_rx_up', 'miss_snr_dn', 'miss_snr_up'},
 'miss_tx_up_12h': {'miss_rx_dn_12h', 'miss_snr_dn_12h'},
 'miss_tx_up_18h': {'miss_rx_dn_18h', 'miss_snr_dn_18h'},
 'miss_tx_up_24h': {'miss_rx_dn_24h', 'miss_snr_dn_24h'},
 'miss_tx_up_2d': {'miss_rx_dn_2d', 'miss_snr_dn_2d'},
 'miss_tx_up_3d': {'miss_rx_dn_3d', 'miss_snr_dn_3d'},
 'miss_tx_up_4d': {'miss_rx_dn_4d', 'miss_snr_dn_4d'},
 'miss_tx_up_5d': {'miss_rx_dn_5d', 'miss_snr_dn_5d'},
 'miss_tx_up_6h': {'miss_rx_dn_6h', 'miss_snr_dn_6h'}}

Looking at this list we can see that their is a pattern that we can first investigate. The way we built the features, we have taken the same measurement aggregated for different time periods: therefore if two measurements are originally highly correlated, each time version should as well.

From the list of correlated features we see that it seems like those underlying variables are identical. One can think of those as a clique of variables:

  • cmts_ms_utilization_up <---> cmts_utilization_up
  • miss_pct_traffic_sdmh_up <---> miss_pct_traffic_dmh_up
  • miss_snr_dn <---> miss_rx_dn <---> miss_tx_up
  • miss_snr_up <---> miss_rx_up

we will first work on those "cliques" such before going further. Indeed these are the correlation that make the most sense as it would be strange that two measurements X and Y are correlated originally but their aggregates over a day isn't.

Cliques of features
In [24]:
suffixes_non_miss = ['','_6h','_12h','_18h','_1d','_2d','_3d','_4d','_5d']
suffixes_miss = ['','_6h','_12h','_18h','_24h','_2d','_3d','_4d','_5d']

The easiest way to perform such exploration would be to create a function that can create multiple pair_plot for each key of the correlated_lists plotting it against each element of the corresponding value-set: this would indeed be very useful in the case of the MISS_ features as they take a finite set of values but also the others as the scatter plot doesn't inform on the concentration of data points (where an hexbin graph would have).

Nevertheless we just want to perform some visual exploration and do not particularly need the population density but rather to see how the joint data points lie w.r.t to the identity line.

In [26]:
compare_candidate_identical('cmts_ms_utilization_up' ,['cmts_utilization_up'],suffixes_non_miss,x_no_missing)

The visual inspection confirms our hypothesis that cmts_ms_utilization_up <---> cmts_utilization_up

In [28]:
compare_candidate_identical('miss_pct_traffic_sdmh_up',['miss_pct_traffic_dmh_up'],suffixes_miss,x_no_missing)

Again a very simple inspection confirms the hypothesis that miss_pct_traffic_sdmh_up <---> miss_pct_traffic_dmh_up

For the third group we will look at:

  1. how similar miss_snr_dn is with each of miss_rx_dn and miss_tx_up
  2. then we will look at how miss_rx_dn and miss_tx_up relate
In [30]:
compare_candidate_identical('miss_snr_dn',['miss_rx_dn','miss_tx_up'],suffixes_miss,x_no_missing)
In [31]:
compare_candidate_identical('miss_rx_dn',['miss_tx_up'],suffixes_miss,x_no_missing)

Pearson correlation always being higher than $0.98$ we are sure that the three variables are indeed a clique in terms of correlation.

In [33]:
compare_candidate_identical('miss_snr_up',['miss_rx_up'],suffixes_miss,x_no_missing)

We have shown that we are indeed in the presence of cliques in each 4 cases (where a connection is set when the two variables are highly correlated) and therefore we need to keep each time only one element per clique

This choice of the feature that we wish to keep, will again be very arbitrary and we do not justify it.

In [34]:
to_drop = ['cmts_ms_utilization_up' + s for s in suffixes_non_miss] + \
                        ['miss_pct_traffic_sdmh_up' + s for s in suffixes_miss] + \
                        ['miss_rx_dn' + s for s in suffixes_miss] + \
                        ['miss_tx_up' + s for s in suffixes_miss] + \
                        ['miss_snr_up' + s for s in suffixes_miss]
dropped_features = dropped_features + to_drop
In [35]:
x_no_missing = x_no_missing.drop(labels=to_drop,axis=1)
In [36]:
x_no_missing.shape
Out[36]:
(10162, 241)
Remaining ones

Now that we have taking out all this cases that unveiled an underlying pattern we can look at particular cases. To do that we run the test again on the remaining DataFrame.

In [38]:
correlated_lists,identical_lists = find_correlation(x_no_missing)
print('We have {} identical feature-pairs'.format(len(identical_lists)))
print('We have {} groups of correlated features'.format(len(correlated_lists)))
We have 0 identical feature-pairs
We have 3 groups of correlated features
In [39]:
correlated_lists
Out[39]:
{'miss_rx_up_24h': {'offline_pct_24h'},
 'miss_rx_up_6h': {'offline_pct_6h'},
 'miss_snr_dn': {'miss_rx_up'}}

Looking at the list of correlated variables we can see that one of the main problems is that this time we do not have each time dimension of the two features matching. Therefore it means that this correlation might hold only on this particular dataset but maybe not on future values.

we will look at these cases individually. (Two plots are generated, one with bins colored according to the log cardinality of points and the other with standard bins in order to have a better idea of the density of clusters in the case of high number of points).

In [42]:
sns.jointplot("miss_rx_up_24h", "offline_pct_24h", data=x_no_missing, kind="hex", joint_kws = {'bins':'log'})
sns.jointplot("miss_rx_up_24h", "offline_pct_24h", data=x_no_missing)
Out[42]:
<seaborn.axisgrid.JointGrid at 0x1a1fc680b8>
In [43]:
sns.jointplot("miss_rx_up_6h", "offline_pct_6h", data=x_no_missing, kind="hex", joint_kws = {'bins':'log'})
sns.jointplot("miss_rx_up_6h", "offline_pct_6h", data=x_no_missing)
Out[43]:
<seaborn.axisgrid.JointGrid at 0x1a1e0e1860>
In [44]:
sns.jointplot("miss_snr_dn", "miss_rx_up", data=x_no_missing, kind="hex", joint_kws = {'bins':'log'})
sns.jointplot("miss_snr_dn", "miss_rx_up", data=x_no_missing)
Out[44]:
<seaborn.axisgrid.JointGrid at 0x11762bc50>

Looking at the plots we can see that the high correlation of those pairs of features reside in the fact that most of the samples are located near the $(0,0)$ point and therefore we decide not to consider them as highly correlated (if they really were we should also see correlation for other time periods - like we saw in previously deleted features).

So we will keep a list of features that we wish to leave out of our analysis for later.

In [46]:
print('By analysing correlated features we have succeeded to get rid of {} features ({:.3f}% reduction).'.format(
    len(dropped_features), 100*len(dropped_features)/x_encoded.shape[1]))
By analysing correlated features we have succeeded to get rid of 54 features (18.305% reduction).

Low variance columns

It is also interesting to look at columns which variance is low. Indeed we wish to investigate predictors that are have both of the following characteristics:

  • very few unique values relative to the number of samples
  • the ratio of the frequency of the most common value to the frequency of the second most common value
In [47]:
columns_to_investigate = nearZeroVar(x_no_missing)
columns_to_investigate
Out[47]:
['offline_pct_6h',
 'offline_pct_12h',
 'offline_pct_18h',
 'offline_pct_24h',
 'offline_pct',
 'offline_pct_1d',
 'offline_pct_2d',
 'offline_pct_3d',
 'offline_pct_4d',
 'offline_pct_5d',
 'miss_cer_dn_6h',
 'miss_cer_dn_12h',
 'miss_cer_dn_24h',
 'pct_traffic_dmh_up_18h',
 'pct_traffic_dmh_up',
 'pct_traffic_sdmh_up_18h',
 'pct_traffic_sdmh_up',
 'pct_traffic_sdmh_up_1d',
 'pct_traffic_sdmh_up_2d',
 'pct_traffic_sdmh_up_3d',
 'miss_rx_up_6h',
 'miss_rx_up_12h',
 'miss_rx_up_24h',
 'model_1',
 'model_2',
 'model_3',
 'model_5',
 'model_6']

As a first observation we can say that all these columns are discrete (since they are mostly percentages out of 6 measurements). Then there is also the model_* features that are One hot encoding and for which we therefore expect a low variance.

We will focus on the features starting by pct_traffic_* as they are indeed percentages but we are not sure in what domain are they expressed.

In [48]:
sns.distplot(x_no_missing['pct_traffic_dmh_up'], bins=20, kde=False, rug=True);
In [58]:
counts = x_no_missing['pct_traffic_dmh_up'].value_counts()
total = counts.sum()
(100*counts/total).head(4)
Out[58]:
 0.000000    90.917142
-0.033182     0.059043
-0.001437     0.049203
-0.002522     0.049203
Name: pct_traffic_dmh_up, dtype: float64

Once we are looking into the distribution of the variable we are quite sceptic regarding how deleting this feature would make sense so we decide to leave it for a later feature selection round (once we have a classifier we csan use it to discard the features that are not very meaningful by ranking them).

We haven't found any conclusive results regarding low variance features

Data preprocessing

Before the data can be used to perform any machine learning task we need to modify a bit its form. And for this we will describe our pipeline

  1. (optional) Map the Milestone to usable labels (for binary classification we need to have only 0s and 1s)
  2. One-Hot encoding of the hardware model (it is a categorical variable)
  3. Handling Missing Values (It happens quite a lot that measurement are missing and we need a strategy to handle such cases).
  4. Scaling the data (such that dimensions are comparable, some models won't work if this isn't done)
In [68]:
y = df['milestone_name']
x = df[feature_cols]

Label Binarization

We will convert the milestone name because we want to interpret it as :

  • if the milestone name is NaN then it means that the CPE is supposed to be healthy (0)
  • otherwise it is considered as sick (1)
In [62]:
y_binary = y.isnull().map(lambda x: 0 if x else 1)

Has been implemented in preprocessing.convert_to_binary_labels()

One-hot encoding

As per the Wikipedia definition

one-hot is a group of bits among which the legal combinations of values are only those with a single high (1) bit and all the others low (0)

In ML, one hot encoding is often used when we are in the presence of a categorical feature. The amplitude of the feature should not be used by the regressor but rather the value it takes. For example, if we consider the feature that was constructed during the import of the data: weekday

In [64]:
print('Different types :\t',set(x.dtypes.values))
print('Discrete features :\t',list(x.dtypes[x.dtypes == 'int64'].index))
Different types :	 {dtype('int64'), dtype('float64')}
Discrete features :	 ['hardware_model', 'n_cpe_building']
In [71]:
x['weekday'].value_counts().sort_index()
Out[71]:
0    1358
1    1343
2     709
3     611
4    1177
5     571
6    1337
Name: weekday, dtype: int64

It takes values ranging from 0 to 6 as there are only 7 days in a week. The classifier should not take into consideration the value of this feature given a weight (which would mean that a high value of weekday may give you an indicator of failure or not). This is why we construct 6 new features: wk_{0,...6} such that wk_i$ = 1$ iff the weekday is 1.

Hence the classifier can give a weight to the fact that the day is Tuesday.

We do this for both the hardware_model and the weekday as they are the only two categorical features. This has been implemented in preprocessing.encode_categorical

Handling missing values

We can see that NULL values are quite present in our dataset and we need a strategy to handle them. Many imputation techniques exists. They are often use to hide those missing values (replacing them by the mean of the feature for example or by a 0). Here we wish our missing values to be meaningful as they could really indicate a problem with this particular CPE, this scenario is often referred to as Missing Not at Random (MNAR).

Nevertheless we have already added to each feature a corresponding MISS_ feature that gives us an indication on the percentage of samples that are missing for the feature computation (because usually those features are aggregates of multiple values).

Therefore we wish to hide as much as possible irregularities of such missing value and will use the mean replacement.

Because we replace by the mean, we will need to always use the mean of the Training set in order to impute values both in the training and testing set. scikit-learn proposes a neat tool to perform this operation everytime we fit and predict (using a pipeline that takes all the steps necessary to transform the data as well as learners)

In [75]:
# the imputer module already provides a way to easily replace missing values by the mean of a particular column.
mean_null_imputer = sklearn.preprocessing.Imputer(strategy='mean')

Scaling of the data

For now the scaling of the data will be fairly arbitrary. We may at a later time compare different scaling on a given model as it might alter performance (some classifiers like Gradient Boosting do not always perform better when scaling the features, while others like a Linear Regression would perform poorly with scaling).

We have three common choices for normalizing the data:

  • log_scaling: good for heavy tailed features $x_i' = \log(x_i)$
  • min-max scaling: may scale to small intervals typical values in case of outliers $x_i' = \frac{x_i-min}{max-min}$
  • standardization: very meaningful in case of normally distributed features $x_i' = \frac{x_i-\mu_i}{\sigma_i}$. Which is the one we chose as the EDA (exploratory data analysis) revealed that most features had approximately normal distribution.
In [77]:
# Again sklearn has this transformer already built in and can be used with pipelines
standardizer = sklearn.preprocessing.StandardScaler()

Dimensionality Reduction

The high dimensionality of the dataset we created has already been mentioned above. This is why we would like to keep features that encapsulates the most information and we will try to look at different tools available to perform such analysis.

Principal component analysis (PCA)

If we look at the definition from our favorite collaborative encyclopedia:

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components.

Therefore for it will compute a new basis for each feature to be represented such that the new 'transformed' features are linearly uncorrelated.

Then there exist a trade-off between the dimensionality of this new basis (the number of components that are considered in the resulting space) and the explained variance of the dataset that will remain in this new representation (compression will result in information loss as one could expect)

Dimensionality reduction usually happens after preprocessing as the last step and therefore we first need to pre-process the data. We will import the data using the Preprocessing functions we have designed to handle all the steps that have been discussed

In [93]:
s_day = '27_04'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'

# We use the help function in order to obtain the dataframe in a correct format.
extracted = import_sample(DATA_PATH,BACKUP_PATH)

# Extract the raw data
x_extracted, y_extracted = get_ml_data(extracted)

# Perform binarization of labels and encoding of categorical values 
y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)

# remove correlated features
x_df = remove_features(x_encoded,verbose = True)
x = x_df.values
Retrieving from ./Data/sample_27_04_sick_only.pk
The sample is composed of : 7106 vectors of dimension 291
	n_sick		=   1164
	n_healthy	=   5942
Deleting 54 features (18.305%).
In [102]:
# preprocessing pipeline construction, impute null values and scale inputs
preproc = make_pipeline(
    sklearn.preprocessing.Imputer(strategy='mean'),
    sklearn.preprocessing.StandardScaler())
X = preproc.fit_transform(x)
In [103]:
# We want to keep the features that allow us to explain 95% of the variance.
pca = PCA()
pca.fit(X)
plt.plot(np.cumsum(pca.explained_variance_ratio_))
plt.xlabel('number of components')
plt.ylabel('cumulative explained variance');

The curve isn't very steep which isn't quite a good news for us given that it forces us to use a lot of components in order to encapsulate a good part of the variance.

Thanks to such tool we can look at our dataset in reduced dimensional spaces: 2D and 3D. It would be interesting to see if we can distinguish between Healthy and Sick CPEs graphically in such space:

In [104]:
x_2D = PCA(2).fit_transform(X)
x_2D_healthy = x_2D[y == 0]
x_2D_sick = x_2D[y == 1]
fig, axes = plt.subplots(nrows=1, ncols=2, sharex = True, sharey = True,figsize=(20, 5))
axes[0].hexbin(x_2D_healthy[:,0],x_2D_healthy[:,1],bins='log',mincnt=1)
axes[0].set_title('Healthy CPE vectors')
axes[1].hexbin(x_2D_sick[:,0],x_2D_sick[:,1],bins='log',mincnt=1)
axes[1].set_title('Sick CPE vectors')
fig.suptitle('2D representation of CPE vectors (PCA red.)')
plt.show()

Visually the difference between the two population isn't evident.

In [97]:
x_3D = PCA(3).fit_transform(X)
threedee = plt.figure().gca(projection='3d')
threedee.scatter(x_3D[:,0], x_3D[:,1], x_3D[:,2])
Out[97]:
<mpl_toolkits.mplot3d.art3d.Path3DCollection at 0x1a3df2add8>

3D representation also doesn't seem to showing much information.

Linear discriminant Analysis (LDA)

Another technique that can be used to perform dimensionality reduction would be to use LDA.

It is a method used in statistics, pattern recognition and machine learning to find a linear combination of features that characterizes or separates two or more classes of objects or events

With such technique the use of standardization isn't necessary. Also because it attempts to separate the two classes, it is a supervised method (it uses the labels to perform the transformation)

In [107]:
preproc_no_scale = make_pipeline(
    sklearn.preprocessing.Imputer(strategy='mean'))
X_prime = preproc_no_scale.fit_transform(x)
In [108]:
LDA = sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
x_prime_1D = LDA.fit_transform(X,y)
In [114]:
x_prime_1D.shape
Out[114]:
(7106, 1)
In [126]:
plotting_df = pd.DataFrame(x_prime_1D,columns=['LDA comp'])
plotting_df['label'] = pd.Series(y).map(lambda x : 'sick' if x == 1 else 'healthy')
sns.boxplot(x= 'label',y="LDA comp", data=plotting_df);

Again it is yet unknown whether such dimensionality reduction will help further classification. Therefore we will try both on our model to see which one performs optimally depending on the model.

Clustering Attempt

Data collection yielded a fairly high level of doubt regarding the way that we label the data for many reasons. Because these reasons will be explained in depth in the thesis we will focus on high-level hints of why our labeling may be wrong:

  • Customer behavior: we are not sure that all users call us when they experience a problem (or they wait so long that they form outlier w.r.t others)
  • Dirty data: as for example
    • UPC doesn't store the piece of hardware that is experiencing problems when the customer calls. But the firm offers a 2-room solution where there are multiple routers at the customer's and therefore we have to mark them all as 'sick' in the case of a call
    • (most of) VIA data comes from THD agents (the agent that answers the call) and can contain many discrepancies. \

Therefore our initial idea was to cluster the data points and use the knowledge of an HFC support expert to determine the labels of each cluster. This is why we started clustering the data.

Before presenting the data to our expert it made sense to ensure that our clustering was indeed discovering an underlying structure in the data as it would be useless to sample from a cluster that is indeed not even clustering data properly

Exploring multiple number of clusters

In order to determine the quality of the clustering (using K-means) for a given number of cluster we will use the a Silhouette analysis as suggested by scikit-learn documentation.

Interpreting these plots resides in two characteristics:

  • we wish to avoid having certain plots that have below average silhouette scores (as indicated by the red line)
  • we wish to have homogeneous size in the silhouette plots (when the classes are supposed to be balanced)

And also an apparently commonly accepted measure is to use the following scale for the silhouette score average:

  1. $[1.0,0.7)$ : A strong structure has been found
  2. $[0.7,0,5)$ : A reasonable structure has been found
  3. $[0.5,0.25)$ : The structure is weak and could be artificial. Try additional methods of data analysis.
  4. $< 0.25$ : No substantial structure has been found
In [163]:
# no dimensionality reduction
scripts.plot.silhouette_analysis(list(range(2,8)),X)
For n_clusters = 2, the average silhouette_score is : 0.2837
For n_clusters = 3, the average silhouette_score is : 0.0235
For n_clusters = 4, the average silhouette_score is : 0.1524
For n_clusters = 5, the average silhouette_score is : 0.0112
For n_clusters = 6, the average silhouette_score is : 0.0120
For n_clusters = 7, the average silhouette_score is : 0.0219

Interpretation of the coefficient values : The best value is 1 and the worst value is -1. Values near 0 indicate overlapping clusters. Negative values generally indicate that a sample has been assigned to the wrong cluster, as a different cluster is more similar.

In [166]:
explained_variance = 0.95
pca = PCA(explained_variance)
pca.fit(X)
dimensions = len(pca.components_)
print("We retained {} dimension(s) to explain {}% of the variance".format(dimensions,100*explained_variance))


# applying a PCA retaining 95% of variance
scripts.plot.silhouette_analysis(list(range(2,8)),pca.fit_transform(X))
We retained 165 dimension(s) to explain 95.0% of the variance
For n_clusters = 2, the average silhouette_score is : 0.4354
For n_clusters = 3, the average silhouette_score is : 0.0928
For n_clusters = 4, the average silhouette_score is : 0.2033
For n_clusters = 5, the average silhouette_score is : 0.0636
For n_clusters = 6, the average silhouette_score is : 0.0160
For n_clusters = 7, the average silhouette_score is : 0.0491

The average silhouette analysis are not so encouraging nevertheless we cannot really be sure that we are not under the curse of dimensionality as we can see that decreasing the number of components increase the silhouette average scores.

We will also try to see whether we can apply the elbow method to find the correct number of clusters (on a much wider range of clusters).

In [169]:
plot_silhouette_score(list(range(2,40)),pca.fit_transform(X))
<matplotlib.figure.Figure at 0x1a2a4646d8>
<matplotlib.figure.Figure at 0x1a27e5a4e0>

From these plots we can observe:

  • the SSE does indeed decrease with the number of clusters but this could be due to the insignificance of distances in such a high dimensional space.
  • the silhouette score indicates that no clustering seems to define a satisfying structure and therefore we need to investigate a bit on the efficiency of our clustering.

Attempt at improving clustering

Balancing classes

One of the assumptions of K-Means is that clusters should have approximately the same size, therefore we try to subsample the 'believed healthy' population.

In [174]:
X_s,y_s = get_balanced_classes(X,y)
scripts.plot.silhouette_analysis(list(range(2,8)),X_s)
For n_clusters = 2, the average silhouette_score is : 0.4065
For n_clusters = 3, the average silhouette_score is : 0.0028
For n_clusters = 4, the average silhouette_score is : 0.0111
For n_clusters = 5, the average silhouette_score is : 0.0439
For n_clusters = 6, the average silhouette_score is : 0.0155
For n_clusters = 7, the average silhouette_score is : 0.0435

In [182]:
K_means = sklearn.cluster.MiniBatchKMeans(n_clusters=2,batch_size=1000)
labels = K_means.fit_predict(X_s)
analyse_clustering(y_s,labels)
Precision = 80.165%	Recall = 8.333%
Out[182]:
Desc Counts Label proportion Cluster proportion
Cluster Labels
healthy 0.0 True negatives 1140 97.938144 51.653829
1.0 False negatives 1067 91.666667 48.346171
sick 0.0 False positives 24 2.061856 19.834711
1.0 True positives 97 8.333333 80.165289

What makes us even more confident about the incorrectness of our clustering is that by running the clustering multiple times doesn't yield the same results and therefore the clustering can be considered as somehow random.

Combining different levels of PCA discrimination with balanced classes

We will compare different levels of retained variance for different clusters with balanced classes. We will be tracking the silhouette score as it seems to give us a fairly good estimate of the quality of the cluster.

In [186]:
plot_silhouette_score_different_PCA(X,[0.3,0.5,0.7,0.9],list(range(2,15)),True, y)

Combining LDA with balanced classes

Supposedly LDA could allow to perform better clustering as it computes a linear combination of the features in order to separate as much as possible the two classes.

In [197]:
LDA = sklearn.discriminant_analysis.LinearDiscriminantAnalysis()
K_means = sklearn.cluster.MiniBatchKMeans(n_clusters=2,batch_size=600)
labels = K_means.fit_predict(LDA.fit_transform(X_s,y_s))
analyse_clustering(y_s,labels)
Precision = 81.667%	Recall = 50.515%
Out[197]:
Desc Counts Label proportion Cluster proportion
Cluster Labels
healthy 0.0 True negatives 1032 88.659794 64.179104
1.0 False negatives 576 49.484536 35.820896
sick 0.0 False positives 132 11.340206 18.333333
1.0 True positives 588 50.515464 81.666667
In [190]:
get_single_silhouette(X_s,labels,2)
Out[190]:
'For n_clusters = 2, the average silhouette_score is : 0.1201'

While the precision and Recall of such clustering are promising we are a bit puzzled by the silhouette score that is alarmingly low which we cannot really explain. And we concluded that LDA could definitely be a great tool for further classification. Nevertheless the fact that the recall is only 50% was alarming as we do not expect 50% of our 'sick' CPEs to be indeed not sick.

Moving on

Given our inability to cluster the data with a meaningful structure we decided to move directly to the design of a classifier. This was done for multiple reasons but mainly because changing the labels based on the results of clustering would not be scientifically correct . We didn't have the possibility to check a significant number of MACs from each cluster to ensure that the clustering was conclusive and therefore would have possibly overestimated the performance of the classifier as we would have introduced an bias that artificially biased our results.

Also we put back in perspective the goal of the project for UPC which was to display the evidence of an underlying structure in the data that would later give birth to a project where we would start modifying the data that is being collected in order to build a proper prediction model.

Failure Prediction

Model Exploration

Because we didn't know what models would fit the best for our task we decided to begin with an extensive exploration of the most famous models to perform a binary classification.

Also in order to compare different models we focused on Receiver Operating Characteristic curve (ROC) of different classifiers. This curve orders predictions based on their score (how sure we are that they belong to the 'sick' group), then for each prediction it will go up if the prediction is correct and right if it isn't. Therefore, the identity line ($x=y$) corresponds to a random classifier.

We are interested in classifier with the steepest curve near zero as it would mean that our top predictions are very accurate (as this would mean that we could order our predictions and then handle only the top $~15\%$). This is in the spirit of our will to show the existence of an underlying structure.

In [3]:
x,y,_ = usable_data('27_04',DATA_FOLDER)
Retrieving from ./Data/sample_27_04_sick_only.pk
The sample is composed of : 7106 vectors of dimension 291
	n_sick		=   1164
	n_healthy	=   5942
Deleting 54 features (18.305%).

Standard cross validation

Our initial assumption was that we could put all the vectors in a dataset regardless of the Day_0 at which they were collected and then cross validate on this dataset. This assumption turned out to be wrong at a later stage but to be complete in our description of our approach we will still present the result of this exploration.

In order to build ROC curves that we can compare and also obtain unbiased estimates of the performance of the classifiers we will compute those two on multiple folds and then obtain the mean and Confidence intervals to determine the stability of the model.

In [19]:
# we said we would try the two preprocessing steps with LDA or PCA
PCA_preproc = make_pipeline(
    sklearn.preprocessing.Imputer(strategy='mean'),
    sklearn.preprocessing.StandardScaler(),
    PCA(0.95))

LDA_preproc = make_pipeline(sklearn.preprocessing.Imputer(
    strategy='mean'), LinearDiscriminantAnalysis())

# We set the ratio of top predictions on which we wish to compute the metrics
TOP_RATIO = 0.2
In [5]:
results = []
create_result_dict = lambda name,prec,rec,F_one,p: {'Method':name, "Precision":prec,"Recall": rec,"F1-Score":F_one, "Parameter":p}

Random Forests

In [7]:
n_est_range = [10,100,1000,5000]
PCA
In [11]:
name,prec,rec,F_one,p = roc_curves(n_est_range,
           lambda param: make_pipeline(
                PCA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
           x,y,
           'Random Forests with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots
LDA
In [12]:
name,prec,rec,F_one,p  = roc_curves(n_est_range,
           lambda param: make_pipeline(
                LDA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
           x,y,
           'Random Forests with LDA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots

Linear Support Vector Machines (SVM)

PCA
In [13]:
pen_range = np.linspace(0.1, 4, 4)
name,prec,rec,F_one,p  = roc_curves(pen_range,
           lambda param: make_pipeline(PCA_preproc,
                                       sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1', dual=False, C=param, random_state=42, max_iter=1000))),
           x, y,
           'Linear SVM with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots
LDA
In [14]:
pen_range = np.logspace(-3, 0, 4)
name,prec,rec,F_one,p  = roc_curves(pen_range,
           lambda param: make_pipeline(LDA_preproc,
                                       sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1', dual=False, C=param, random_state=42, max_iter=1000))),
           x, y,
           'Linear SVM with LDA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots

Non-Linear Support Vector Machines

In [15]:
pen_range = np.logspace(-0.5, 0, 4)
PCA
In [16]:
name,prec,rec,F_one,p = roc_curves(pen_range,
           lambda param: make_pipeline(
                PCA_preproc,
                sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param))),
            x, y,
            'Non Linear SVM with PCA',
            TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots
LDA
In [17]:
name,prec,rec,F_one,p = roc_curves(pen_range,
           lambda param: make_pipeline(
                LDA_preproc,
                sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param))),
            x, y,
            'Non Linear SVM with PCA',
            TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots

Gradient Boosting

PCA
In [18]:
name,prec,rec,F_one,p = roc_curves([10,100,500,1000],
           lambda param: make_pipeline(
               PCA_preproc,
               sklearn.ensemble.GradientBoostingClassifier(n_estimators=param)),
           x,y,
           'Gradient Boosting with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots
LDA
In [19]:
name,prec,rec,F_one,p = roc_curves([10,100,500,1000],
           lambda param: make_pipeline(
               LDA_preproc,
               sklearn.ensemble.GradientBoostingClassifier(n_estimators=param)),
           x,y,
           'Gradient Boosting with LDA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots

k - Nearest Neighbors (kNN)

PCA
In [20]:
name,prec,rec,F_one,p = roc_curves(np.linspace(1,100,20),
           lambda param: make_pipeline(
               PCA_preproc,
               sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param))),
           x,y,
           'Gradient Boosting with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (20/20) Generating subplots
LDA
In [21]:
name,prec,rec,F_one,p = roc_curves(np.linspace(1,100,20),
           lambda param: make_pipeline(
               LDA_preproc,
               sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param))),
           x,y,
           'Gradient Boosting with LDA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (20/20) Generating subplots

Logistic Regression

PCA
In [22]:
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,2,10),
           lambda param: make_pipeline(
               PCA_preproc,
               sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1)),
           x,y,
           'Logistic Regression with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (10/10) Generating subplots
LDA
In [23]:
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,0,4),
           lambda param: make_pipeline(
               PCA_preproc,
               sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1)),
           x,y,
           'Logistic Regression with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (4/4) Generating subplots

Ridge Regression

PCA
In [24]:
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,2,10),
           lambda param: make_pipeline(
               PCA_preproc,
               sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param))),
           x,y,
           'Ridge Regression with PCA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (10/10) Generating subplots
LDA
In [25]:
name,prec,rec,F_one,p = roc_curves(np.logspace(-3,2,10),
           lambda param: make_pipeline(
               LDA_preproc,
               sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param))),
           x,y,
           'Ridge Regression + LDA',
           TOP_RATIO)
results.append(create_result_dict(name,prec,rec,F_one,p))
[############################################################] 100.0% (10/10) Generating subplots
In [27]:
results_df = pd.DataFrame(results)
results_df.sort_values(by='F1-Score',ascending=False)
Out[27]:
F1-Score Method Parameter Precision Recall
0 91.704730 Random Forests with PCA 5000.000000 84.731183 100.000000
4 89.387325 Non Linear SVM with PCA 1.000000 80.860215 100.000000
6 89.282233 Gradient Boosting with PCA 500.000000 80.860215 100.000000
9 88.128110 Gradient Boosting with LDA 27.052632 78.924731 100.000000
11 87.946957 Logistic Regression with PCA 1.000000 78.494624 100.000000
3 87.864028 Linear SVM with LDA 0.010000 78.494624 100.000000
13 86.868263 Ridge Regression + LDA 7.742637 76.989247 100.000000
12 86.694089 Ridge Regression with PCA 0.003594 76.559140 100.000000
7 86.280768 Gradient Boosting with LDA 10.000000 76.129032 100.000000
10 86.150735 Logistic Regression with PCA 0.046416 75.698925 100.000000
2 85.877447 Linear SVM with PCA 0.100000 75.268817 100.000000
8 83.698087 Gradient Boosting with PCA 37.473684 72.197288 99.701493
5 83.049292 Non Linear SVM with PCA 0.681292 71.182796 100.000000
1 81.912445 Random Forests with LDA 1000.000000 69.462366 100.000000

Looking at theses results we decide to focus on three different models that seemed to be the most promising for our prediction task:

  • Random forests with PCA
  • Non-Linear SVM with PCA
  • Gradient Boosting with PCA

Hold out days between training and testing

Before moving forward we were interested in the robustness of the model regarding the number of days between training and testing. That would tell us for how long can we use the model before retraining it.

To work on this test we decided to take the most recent sample of CPEs we would have in order to have more sick CPEs examples and therefore to leverage as much as possible the data that was being collected during this period.

In [29]:
DATA_FOLDER = './Data'
s_day = '16_05'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'


extracted = import_sample(DATA_PATH,BACKUP_PATH)
x_extracted, y_extracted = get_ml_data(extracted)

y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)
x = remove_features(x_encoded,verbose = True).values
Retrieving from ./Data/sample_16_05_sick_only.pk
The sample is composed of : 10972 vectors of dimension 291
	n_sick		=   3057
	n_healthy	=   7915
Deleting 54 features (18.305%).

In order to do that we will create a training dataset and an hold-out set. To build this Hold-out set we will keep 3 days of data in the hold out and will look at the accuracy on this hold out set for different gaps between training and testing.

In [30]:
def get_performance_gap_hold_out(extracted,ho_n_days,gap_n_days,clf_pipeline,top_ratio = 0.2):
    """
    Allow us to compute naively classification performance metrics on a hold out set 
    that is gap_n_days away from the training dataset days.
    
    @params:
    - extracted: unreprocessed dataset obtained from import_sample 
    - ho_n_days: the number of days we wish to have in our hold out set
    - gap_n_days: the number of days of difference between the training dates and testing dates
    - clf_pipeline: an instantiated classification pipeline
    - top_ratio: the ratio of top predictions we wish to use in order to test our model
    
    NB: the function makes the assumption that the dates are consecutive inside the sample.
    """
    # Compute the dates we wish to work with
    dates = extracted['day_0'].unique()
    dates.sort()
    ho_days = dates[-(1+ho_n_days):-1]
    dataset_days = dates[:len(dates)-(ho_n_days+gap_n_days+1)]

    # get the two datasets
    dataset = extracted[extracted['day_0'].isin(dataset_days)]
    hold_out = extracted[extracted['day_0'].isin(ho_days)]
    print('Hold Out population = {}\nTraining Population = {}'.format(len(hold_out),len(dataset)))

    # prepare these datasets
    x_extracted, y_extracted = get_ml_data(dataset)
    y = convert_to_binary_labels(y_extracted)
    x_encoded = encode_categorical(x_extracted)
    x = remove_features(x_encoded,verbose = False).values
    x_s,y_s = get_balanced_classes(x,y)
    clf_pipeline.fit(x_s,y_s)

    x_extracted_ho, y_extracted_ho = get_ml_data(hold_out)
    y_ho = convert_to_binary_labels(y_extracted_ho)
    x_encoded_ho = encode_categorical(x_extracted_ho)
    x_ho = remove_features(x_encoded_ho,verbose = False).values
    x_s_ho,y_s_ho = get_balanced_classes(x_ho,y_ho)
    
    preds = clf_pipeline.predict_proba(x_s_ho)
    return get_metrics(preds,y_s_ho,top_ratio)
In [35]:
clf = make_pipeline(
    PCA_preproc,
    RandomForestClassifier(n_estimators = 5000, random_state = 42,n_jobs=-1))
P,R,F1 = get_performance_gap_hold_out(extracted,5,0,clf)
Hold Out population = 2004
Training Population = 8654
In [36]:
print('With {} days of hold out between the training and testing and with 2 days in the testing set:\nP = {:.2f}%\tR = {:.2f}%\tF1 = {:.2f}%'.format(2,P,R,F1))
With 2 days of hold out between the training and testing and with 2 days in the testing set:
P = 51.25%	R = 100.00%	F1 = 67.77%

This result was quite surprising. We realized that we may have been wrong by assuming that we could cross validate on the dataset composed of all vectors regardless of their day_0. Even though this method doesn't perform a cross validation we can see that the performance were clearly overly optimistic and that we need to change the way we evaluate the performance of our classifier.

Regarding the possible causes of this drop in performance we could argue that the model may be overfitting particular dates and therefore when such dates are both in the testing and training the performance get abnormally high.

K-Fold on distinct days

Hence we will now use a K-Fold that will consider distinct days in each fold and therefore will always train test on distinct sets of dates which will ensure that we do not over estimate the performance of the classifier in the real use case (if this predictions must be performed on a daily basis it seems trivial to understand that we will not have access to the calls of the day on which we are trying to predict and therefore the vectors for this particular day_0 won't be in the training set).

Again we use the most recent sample that is available to us.

In [3]:
x,y,dates = usable_data('22_05',DATA_FOLDER)
Retrieving from ./Data/sample_22_05_sick_only.pk
The sample is composed of : 14330 vectors of dimension 291
	n_sick		=   3733
	n_healthy	=  10597
Deleting 54 features (18.305%).
In [39]:
distinct_date_split(x,y,dates,k=2)
Out[39]:
[(array([13505,  9913, 14070, ...,  8259,  5309,  2112]),
  array([10712,   800,  4562, ...,  9445,   398, 11966])),
 (array([3621, 8653, 2792, ..., 3788, 5693,  560]),
  array([ 1730,   176,  9993, ...,  6320,  6197, 10157]))]

We also saw that training on a balanced dataset improved the performance of a clustering algorithm and it was therefore thought that it would be better to also balance somehow our training sets. We also thought that to be similar to a regular cross validation it would desirable to shuffle the two sets therefore we shuffled the dates among the different folds.

To be able to perform a comparison between this new cross validation and the other we have added in all the evaluation methods used previously the possibility to use this new cross validation.

Comparing it to the old technique

In [41]:
# Old method
roc_curves([1000],
           lambda param: make_pipeline(
                PCA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
           x,y,
           'PCA + standard scaling + RF',
           TOP_RATIO,dates=dates,distinct_date=False)
[############################################################] 100.0% (1/1) Generating subplots
Out[41]:
('PCA + standard scaling + RF',
 87.1571906354515,
 100.0,
 93.12896368287052,
 1000)
In [42]:
# New method
roc_curves([1000],
           lambda param: make_pipeline(
                PCA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1)),
           x,y,
           'PCA + standard scaling + RF',
           TOP_RATIO,dates=dates,distinct_date=True)
[############################################################] 100.0% (1/1) Generating subplots
Out[42]:
('PCA + standard scaling + RF',
 48.95450752817972,
 100.0,
 65.70357477019249,
 1000)

We can see that even now that we have a cross validated measurement the drop in performance remains important. Also it might seem very useless to have a precision of less than $50\%$ (as we thought at first) but the thing is that we are talking about the Precision on the sick class and not the accuracy of our classifier.

Let's put things in perspective:

In [43]:
# Let's look at the number of sick CPEs per day
df = pd.DataFrame()
df['date'] = dates
df['sick'] = y
df.groupby('date').sum().mean()
Out[43]:
sick    103.694444
dtype: float64

The previous analysis shows that we have on average 103 sick routers every day. One should keep in mind that we are working on subsample of the dataset that contains all the sick routers but only a small fraction of the healthy ones. From our SQL data collection there are approximately 500k distinct routers that we collect every day.

Therefore if an operator would have to check randomly a router from the full network he would have a probability of $\approx \frac{100}{500000} = 0.02\%$ to find a sick CPE. If he checks from the subset of routers we declare as being sick he now has a probability of $\approx 49\%$ to find a sick one.

That is, to our belief a significant improvement and we have decided to keep exploring the models using such technique in order to see what performance we could achieve on top predictions.

Changing the way we visualize performance: cutoff probability

for now we are simply considering our top $20\%$ predictions based on their probability to belong to the sick class and we would like to have a cutoff_proba instead to select only the predictions that we are the most sure about.

Therefore we will look at the evolution of the precision and recall for different models such that we can encapsulate the trade-off between the two. Either we catch more sick CPEs but have to drop the precision of our prediction (="the number of CPEs that we declare as being sick and that are indeed sick") or we will catch less but will be (more) sure about the ones we declare as being sick.

In order to evaluate our models we will look at the highest level of recall that can be achieved for a desired precision. This will in vulgarized terms tell us the proportion of CPEs that we can achieve depending on how sure we wish to be of our prediction.

We have to use such evaluation as the true precision-recall curve cannot be approximated in an unbiased manner using a cross validation.

In [62]:
n_est_range = [10,20]
pipeline = lambda param: make_pipeline(
                PCA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
precision_recall_curves(n_est_range,pipeline,x,y,'PCA RF',[0.8,0.9],dates)
Out[62]:
['PCA RF', 0.04771053018875256, 0.014791742246364082, 0.4145738337250578, 20]

On these subplots the first line gives the average recall for different precision levels. E.g. for a random forest with 20 trees:

  • for P = 80%, the max recall achievable is 1.22% with a cutoff probability of 0.93
  • for P = 90%, it is 0.36% with a cutoff probability of 0.96

Deciding on the level of precision was done in accordance with the Business Intelligence team that determined that those would be more helpful to perform business decisions later on.

New Cross validation

We can do the same kind of exploration than the one that we performed using the old technique, but this time we will have an trustworthy estimate of the performance of the classifier.

In [ ]:
precision_levels = [0.7,0.8,0.9]
In [14]:
results = []
create_result_dict = lambda name,seventy,eighty,ninety,auc,p: {'Method':name, 
                                                               "Recall for 70% Acc":seventy,
                                                               "Recall for 80% Acc": eighty,
                                                               "Recall for 90% Acc":ninety,
                                                               "AUC":auc, 
                                                               "Parameter":p}                                            

As it will be explained in the Understanding Classification section we have realized that dimensionality reduction might actually be hurting the performance of classifiers.

In [15]:
simple_preproc = make_pipeline(
    sklearn.preprocessing.Imputer(strategy='mean'),
    sklearn.preprocessing.StandardScaler())

Random Forests

No dimensionality reduction
In [17]:
params = [50,100,1000,5000]
pipeline = lambda param: make_pipeline(
                simple_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Random Forests',precision_levels,dates)

results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (4/4) Generating subplots
PCA
In [20]:
params = [100,1000,5000]
pipeline = lambda param: make_pipeline(
                PCA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Random Forests with PCA',precision_levels,dates)

results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (3/3) Generating subplots
LDA
In [21]:
params = [100,1000,5000]
pipeline = lambda param: make_pipeline(
                LDA_preproc,
                RandomForestClassifier(n_estimators = param, random_state = 42,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Random Forests with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (3/3) Generating subplots

Linear SVM

No dimensionality reduction
In [26]:
params = np.logspace(-3,0,10)
pipeline = lambda param: make_pipeline(simple_preproc,
                                       sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1',dual=False,C=param,random_state=42, max_iter=1000)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Linear SVM',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
PCA
In [27]:
pipeline = lambda param: make_pipeline(PCA_preproc,
                                       sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1',dual=False,C=param,random_state=42, max_iter=1000)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Linear SVM with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
LDA
In [28]:
pipeline = lambda param: make_pipeline(LDA_preproc,
                                       sklearn.calibration.CalibratedClassifierCV(sklearn.svm.LinearSVC(penalty='l1',dual=False,C=param,random_state=42, max_iter=1000)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Linear SVM with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots

Non-Linear SVM

No dimensionality reduction
In [29]:
params = np.logspace(-1,0,4)
pipeline = lambda param: make_pipeline(
                            simple_preproc,
                            sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param)))
name,seventy,eighty,ninety,AUC,p =  precision_recall_curves(params,pipeline,x,y,'Non Linear SVM',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (4/4) Generating subplots
PCA
In [30]:
pipeline = lambda param: make_pipeline(
                            PCA_preproc,
                            sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param)))
name,seventy,eighty,ninety,AUC,p =  precision_recall_curves(params,pipeline,x,y,'Non Linear SVM with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (4/4) Generating subplots
LDA
In [31]:
pipeline = lambda param: make_pipeline(
                            LDA_preproc,
                            sklearn.calibration.CalibratedClassifierCV(sklearn.svm.SVC(C=param)))
name,seventy,eighty,ninety,AUC,p =  precision_recall_curves(params,pipeline,x,y,'Non Linear SVM with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (4/4) Generating subplots

Gradient Boosting

No dimensionality reduction
In [32]:
params = [50,100,1000,5000]
pipeline = lambda param: make_pipeline(simple_preproc,sklearn.ensemble.GradientBoostingClassifier(n_estimators=param))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Gradient Boosting',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (4/4) Generating subplots
PCA
In [33]:
params = [500,1000,5000]
pipeline = lambda param: make_pipeline(PCA_preproc,sklearn.ensemble.GradientBoostingClassifier(n_estimators=param))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Gradient Boosting with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (3/3) Generating subplots
LDA
In [41]:
params = [500,1000,5000]
pipeline = lambda param: make_pipeline(LDA_preproc,sklearn.ensemble.GradientBoostingClassifier(n_estimators=param))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Gradient Boosting with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (3/3) Generating subplots

kNN

No dimensionality reduction
In [42]:
params = np.linspace(1,100,10)
pipeline = lambda param: make_pipeline(simple_preproc,sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'kNN',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
PCA
In [43]:
pipeline = lambda param: make_pipeline(PCA_preproc,sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'kNN with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
LDA
In [44]:
pipeline = lambda param: make_pipeline(LDA_preproc,sklearn.neighbors.KNeighborsClassifier(n_neighbors=int(param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'kNN with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots

Logistic Regression

No dimensionality reduction
In [45]:
params = np.logspace(-3,2,20)
pipeline = lambda param: make_pipeline(
                simple_preproc,
                sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Logistic Regression',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (20/20) Generating subplots
PCA
In [46]:
pipeline = lambda param: make_pipeline(
                PCA_preproc,
                sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Logistic Regression with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (20/20) Generating subplots
LDA
In [47]:
pipeline = lambda param: make_pipeline(
                LDA_preproc,
                sklearn.linear_model.LogisticRegression(C=param, max_iter=500,n_jobs=-1))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Logistic Regression with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (20/20) Generating subplots

Ridge Regression

No dimensionality reduction
In [64]:
params = np.logspace(-3,2,10)
pipeline = lambda param: make_pipeline(
                            simple_preproc,
                            sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Ridge Regression',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
PCA
In [65]:
pipeline = lambda param: make_pipeline(
                            PCA_preproc,
                            sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Ridge Regression with PCA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
LDA
In [66]:
pipeline = lambda param: make_pipeline(
                            LDA_preproc,
                            sklearn.calibration.CalibratedClassifierCV(sklearn.linear_model.RidgeClassifier(alpha=param)))
name,seventy,eighty,ninety,AUC,p = precision_recall_curves(params,pipeline,x,y,'Ridge Regression with LDA',precision_levels,dates)
results.append(create_result_dict(name,seventy,eighty,ninety,AUC,p))
[############################################################] 100.0% (10/10) Generating subplots
In [67]:
results_df = pd.DataFrame(results)
results_df.sort_values(by='Recall for 90% Acc',ascending=False)
Out[67]:
AUC Method Parameter Recall for 70% Acc Recall for 80% Acc Recall for 90% Acc
9 0.528025 Gradient Boosting 50.000000 0.252317 0.185119 0.104765
0 0.517642 Random Forests 5000.000000 0.240582 0.168757 0.053941
6 0.482226 Non Linear SVM 1.000000 0.185460 0.104001 0.043300
7 0.470795 Non Linear SVM with PCA 0.464159 0.161119 0.081118 0.040773
4 0.466997 Linear SVM with PCA 0.215443 0.175715 0.108385 0.039879
16 0.472952 Logistic Regression with PCA 0.003360 0.203834 0.116751 0.039143
1 0.475284 Random Forests with PCA 5000.000000 0.180623 0.093441 0.034864
15 0.477251 Logistic Regression 0.011288 0.203331 0.145606 0.034528
19 0.459569 Ridge Regression with PCA 27.825594 0.189331 0.086615 0.031486
18 0.462197 Ridge Regression 100.000000 0.192833 0.099034 0.024675
3 0.474212 Linear SVM 0.004642 0.219301 0.106904 0.024345
10 0.467468 Gradient Boosting with PCA 500.000000 0.172155 0.101335 0.018533
20 0.464163 Ridge Regression with LDA 0.001000 0.193225 0.115953 0.016489
5 0.456952 Linear SVM with LDA 0.001000 0.166419 0.113650 0.012468
17 0.461781 Logistic Regression with LDA 0.001833 0.187344 0.110134 0.010464
11 0.435256 Gradient Boosting with LDA 500.000000 0.152451 0.055144 0.006874
8 0.394185 Non Linear SVM with LDA 0.100000 0.001510 0.001510 0.001510
12 0.451739 kNN 1.000000 0.000000 0.000000 0.000000
13 0.458059 kNN with PCA 1.000000 0.000000 0.000000 0.000000
14 0.493456 kNN with LDA 1.000000 0.000000 0.000000 0.000000
2 0.414804 Random Forests with LDA 1000.000000 0.000000 0.000000 0.000000

On the previous plots it is important to note that the darker curve isn't the mean of the the curve for each fold and this is due to the difficulty to perform interpolation in the PR space. There exist another technique with is called score concatenation, it proposes here to accumulate all the y_pred and all the y_real of each fold and to then plot the curve by considering the accumulates as the real dataset of predicted and real target. This allow us to display some characteristics that we wish to use to compare different models:

  • We would ideally like a non increasing curve (otherwise we wouldn't be more confident of predictions with a high probability of belonging to the 'sick' class.
  • Also we would like to have a larger area under the curve on the leftmost part of the curve (which will also mean that we have a higher precision per recall)

Understanding Classification

No dimensionality reduction improves score

Before we could go further we decided in accordance with the Business Intelligence team to perform some checks in order to understand what are the features that carry the most importance for the best classifier. That would allow us to perform a reality check with the HFC support team.

Because many classifiers have similar level of performance (we are mostly looking at the Recall for 90% Accuracy) we decided to use the top one Gradient Boosting also because it has a built in function to get the feature importance.

Because we were interested in the initial components and not the PCA components importance we tried a couple of things including looking at the feature importance without applying an initial PCA. That showed us that without PCA the model was performing much better. :

  • need less tree (and therefore a shorter training time)
  • yields higher accuracy and recall

We realized our mistake of not trying first without any dimensionality reduction during the classification. And we decided to add this preprocessing to our model exploration (for simplicity we have already added that in the Model exploration above and will simply.

Investigating the top 4 features

In [52]:
# we import the data again in order to obtain the features name that are deleted in xb
s_day = '22_05'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'


extracted = import_sample(DATA_PATH,BACKUP_PATH)
dates = extracted['day_0'].values
x_extracted, y_extracted = get_ml_data(extracted)

x_encoded = encode_categorical(x_extracted)
x_df = remove_features(x_encoded,verbose = True)
Retrieving from ./Data/sample_22_05_sick_only.pk
The sample is composed of : 14330 vectors of dimension 291
	n_sick		=   3733
	n_healthy	=  10597
Deleting 54 features (18.305%).
In [54]:
clf = make_pipeline(simple_preproc, 
                    sklearn.ensemble.GradientBoostingClassifier(n_estimators=50))
clf.fit(x,y)
Out[54]:
Pipeline(memory=None,
     steps=[('pipeline', Pipeline(memory=None,
     steps=[('imputer', Imputer(axis=0, copy=True, missing_values='NaN', strategy='mean', verbose=0)), ('standardscaler', StandardScaler(copy=True, with_mean=True, with_std=True))])), ('gradientboostingclassifier', GradientBoostingClassifier(criterion='fried...      presort='auto', random_state=None, subsample=1.0, verbose=0,
              warm_start=False))])
In [100]:
features = list(x_df.columns)
feature_analysis = pd.DataFrame()
feature_analysis['feature'] = features
feature_analysis['importance'] = clf.steps[1][1].feature_importances_
feature_analysis = feature_analysis.sort_values(by='importance',ascending=False)
feature_analysis.head(4)
Out[100]:
feature importance
5 offline_pct 0.086595
110 miss_snr_dn_6h 0.073434
223 cmts_utilization_dn_2d 0.046183
159 cmts_cer_up_1d 0.039347
In [71]:
n_useless = len(feature_analysis[feature_analysis['importance'] == 0])
print('There are {} features that are not used by the classifier'.format(n_useless))
There are 121 features that are not used by the classifier
In [73]:
n_top = 4
to_analyse = list(feature_analysis['feature'].head(n_top).values)
df = pd.DataFrame(x_df.filter(items=to_analyse))
df['sick'] = y
df.head(4)
Out[73]:
offline_pct miss_snr_dn_6h cmts_utilization_dn_2d cmts_cer_up_1d sick
0 0.0 0.0 3.619974 -0.018750 0
1 0.0 0.0 6.365391 -0.070833 0
2 0.0 0.0 1.221328 0.000521 1
3 0.0 0.0 3.089870 -0.002188 1
In [77]:
compare_distrib(to_analyse[0],df)
In [78]:
compare_distrib(to_analyse[1],df)
In [79]:
compare_distrib(to_analyse[2],df)
In [80]:
compare_distrib(to_analyse[3],df)

On these four variables we have decided to bin the range of values that was taken by the variable and plotted side by side the distribution of the variable among the sick and healthy population. One of the most interesting point from this analysis is the fact that the most important feature is the offline_pct and it shows that sick CPEs have a much greater chance of being offline at least one hour during the last 24 hours than healthy CPEs.

That is also interesting as we also have in the feature the offline percentage over the last 24 hours by 6 hour windows. Hence this could indicate that we do not have the same response time between the problem happens and the call for all users.

But also it shows us that the fact that the CPE went offline is a great indicator of sickness and therefore that we may consider that a CPE is offline when it is experiencing problems (that could influence the way that we use predictions to proactively act on the CPEs)

Tuning the model

We have seen during the model exploration phase that Gradient Boosting shows promising results compared to other models. Therefore we will try to tune this model as much as possible. In order to perform such tuning we need to define a scoring function.

While the ROC-AUC (Area Under the ROC Curve) seems to be used fairly often to measure the performance of classifier we are here in a case that corresponds to anomaly detection and therefore the ROC curve is less powerful as it cannot well discriminate when the number of TN is much higher than TP. Therefore we will use this metric in order to score our classifiers.

Nevertheless as explained earlier we are really focusing on a particular range of precision as we have no business use for low precision and high recall. Therefore we have decided to use the partial Area under the Precision Recall Curve such that we get a single number summarizing how does the classifier performs on the leftmost part of the curve.

Rk: while we have been focusing so far on the Recall for 90% accuracy to discriminate models that will give us a better way to fine-tune a given model

Tuning Boosting Parameters

The general approach is to fix the learning rate and the number of estimators for tuning tree based parameters. Indeed there are a lot of parameters that can be modified therefore we need a strategy to avoid an exponentially growing number of possible combination.

In [97]:
learning_rates = np.linspace(0.05,0.3,5)
n_estimators = [int(x) for x in np.linspace(20,100,10)]
params = {'learning_rate': learning_rates, 'n_estimators':n_estimators}

estimator = lambda kw_args: make_pipeline(simple_preproc,
                                          sklearn.ensemble.GradientBoostingClassifier(**kw_args))

res = custom_GridSearchCV(x,y,dates,estimator,params,cv=5)
[############################################################] 100.0% (50/50) Trying different combinations

Tuning Tree Based parameter

Following the advices given here we will use the previously determined optimal Boosting parameters to grid search for the tree specific parameters in the following order:

  1. Tune max_depth (typically between 2 and 10) and num_samples_split (typically ~0.5-1% of total values)
  2. Tune min_samples_leaf
In [99]:
optimal_args = res['best']['best_args']
opt_learning_rate = optimal_args['learning_rate']
opt_n_est = optimal_args['n_estimators']

params = {'learning_rate': [opt_learning_rate], 'n_estimators': [opt_n_est], 
             'max_depth':range(5,16,2), 'min_samples_split':range(200,1001,200)}

res = custom_GridSearchCV(x,y,dates,estimator,params,cv=5)
[############################################################] 100.0% (30/30) Trying different combinations

Looking at the final model

In [101]:
optimal_args = res['best']['best_args']
print('Our Grid Search reveiled that the following are the optimal parameters')
print('\tBoosting Parameters:\n\t\tNumber of estimators: {}\n\t\tLearning Rate: {}'.format(optimal_args['n_estimators'],optimal_args['learning_rate']))
print('\tTree Based Parameters:\n\t\tMaximum depth: {}\n\t\tMinimun samples splits: {}'.format(optimal_args['max_depth'],optimal_args['min_samples_split']))
Our Grid Search reveiled that the following are the optimal parameters
	Boosting Parameters:
		Number of estimators: 64
		Learning Rate: 0.1125
	Tree Based Parameters:
		Maximum depth: 5
		Minimun samples splits: 200
In [136]:
# save the model to disk
import pickle,os

filename = DATA_FOLDER + '/final_model_untrained.sav'
if(not os.path.isfile(filename)):
    pickle.dump(estimator(res['best']['best_args']), open(filename, 'wb'))
    clf = estimator(res['best']['best_args'])
else:
    # load model
    clf = pickle.load(open(filename, 'rb'))

PR curve

In [103]:
single_precision_recall_curve(estimator(res['best']['best_args']), x, y, dates,'')
Out[103]:
array([ 0.26534215,  0.18457023,  0.10009026,  0.52415743])

ROC Curve

In [104]:
single_roc_curve(x,y,estimator,res['best']['best_args'],top_ratio = 0.2,distinct_date=True,dates=dates)
Out[104]:
array([  52.042537  ,  100.        ,   68.43856789])

Model Discussion

In this section we will put the different analysis that have been done on the model, they didn't necessarily took place once we had the final model but we judged that they were not adding a lot of information to a potential reader in the process that has been described above (they didn't influence the choices that we have made)

Model output check

We didn't have so far the possibility to check the output of our model because of the lack of resources that the HFC support team could dedicate to this project. We waited the very last time to perform a prediction on the day where we would meet the team in order to check whether they would see a problem with the predicted CPEs or not.

To perform such check we had to use a sample from the database that was computed on the same day (describing the health measurements relative to day_0 being the previous day) than the meeting but also we had to modify the way we build our datasets.

Building the test set: importing the full dataset

Beforehand we used to delete the days where there are no sick CPEs but indeed the 8 last days of our sample are always without any sick CPEs as we do not yet know what are the FTR (first time resolution) flags yet

In [105]:
s_day = '11_06'
DATA_PATH = DATA_FOLDER + '/sample_'+s_day+'.xlsx'
BACKUP_PATH = DATA_FOLDER+'/sample_'+s_day+'.pk'

full_extracted = import_sample(DATA_PATH,BACKUP_PATH,delete_only_healthy_days=False)
full_dates = full_extracted['day_0'].values

# we get the most recent day to build the test set
test_extracted = full_extracted[full_extracted['day_0'] == np.max(np.unique(full_dates))]
x_extracted_te, y_extracted_te = get_ml_data(test_extracted)
x_encoded_te = encode_categorical(x_extracted_te)
x_te_df = remove_features(x_encoded_te,verbose = True)
x_te = x_te_df.values
Retrieving from ./Data/sample_11_06_full.pk
The sample is composed of : 20715 vectors of dimension 291
	n_sick		=   5733
	n_healthy	=  14982
Deleting 54 features (18.305%).

Building the training set: deleting days where there are no sick CPEs

We do need to have the labels for the training set and therefore we use our usual technique to get the data ready for machine learning:

In [106]:
x_tr,y_tr,dates = usable_data('11_06',DATA_FOLDER)
Retrieving from ./Data/sample_11_06_sick_only.pk
The sample is composed of : 18636 vectors of dimension 291
	n_sick		=   5733
	n_healthy	=  12903
Deleting 54 features (18.305%).

Cutoff probabilities

What the Precision Recall curve shows us is also the probability at which we need to cutoff in order to consider a CPE healthy or sick depending on the levels of (Precision,Recall) we wish to obtain (on average).

In [108]:
single_precision_recall_curve(estimator(res['best']['best_args']), x_tr, y_tr, dates,'')
Out[108]:
array([ 0.32460973,  0.24243559,  0.12910803,  0.58073476])
In [109]:
# fit the model
clf = estimator(res['best']['best_args'])
clf.fit(x_tr,y_tr)
# predict
cutoff = 0.7556
sick_probas = clf.predict_proba(x_te)[:,1]
test_extracted['sick_proba'] = sick_probas
test_extracted[test_extracted['sick_proba'] >= cutoff]
Out[109]:
weekday day_0 mac cly_account_number saa_account_number cmts service_group hardware_model n_cpe_building offline_pct_6h ... cmts_utilization_dn_18h cmts_utilization_dn cmts_utilization_dn_1d cmts_utilization_dn_2d cmts_utilization_dn_3d cmts_utilization_dn_4d cmts_utilization_dn_5d seq_id milestone_name sick_proba
20196 6 2018-06-10 546751F40E59 6219737 6219737 ch-mbcLOC301 RW25 0 17 16.666667 ... -0.969167 12.788516 NaN NaN NaN NaN NaN 31653587 NaN 0.778949

1 rows × 292 columns

Because we didn't have any CPE that was satisfying the most discriminative cutoff, we decided to go for the second most discriminative in order to check some suspected CPEs with the HFC support team.

More test dates

Because we only had 1 CPE at hand, we decided to test on more dates as the team also has access to an history up to 10 days in the tools that help them to give us a diagnostic.

In [110]:
# we get all the dates that won't be in the training set
test_extracted = full_extracted[full_extracted['day_0'].isin(np.setdiff1d(full_dates,dates))]
x_extracted_te, y_extracted_te = get_ml_data(test_extracted)
x_encoded_te = encode_categorical(x_extracted_te)
x_te_df = remove_features(x_encoded_te,verbose = True)
x_te = x_te_df.values
Deleting 54 features (18.305%).
In [112]:
sick_probas = clf.predict_proba(x_te)[:,1]
test_extracted['sick_proba'] = sick_probas
test_extracted[test_extracted['sick_proba'] >= cutoff]
Out[112]:
weekday day_0 mac cly_account_number saa_account_number cmts service_group hardware_model n_cpe_building offline_pct_6h ... cmts_utilization_dn_18h cmts_utilization_dn cmts_utilization_dn_1d cmts_utilization_dn_2d cmts_utilization_dn_3d cmts_utilization_dn_4d cmts_utilization_dn_5d seq_id milestone_name sick_proba
13744 5 2018-05-19 905C44171E3B 5663146 5663146 mbcCHR301 RW85 0 4 0.000000 ... 2.464896 11.082812 0.338516 -0.625000 -0.362214 0.810781 -0.015156 21798468 NaN 0.801936
13803 5 2018-05-19 5467518B1C02 678993 678993 ch-mbcAMO301 RW07 0 43 NaN ... -3.720312 26.944036 7.533854 -3.300807 -1.294115 -3.196578 3.237333 21625733 NaN 0.780121
16790 0 2018-05-28 546751ACAC08 5623418 5623418 ch-mbcALT301 RW13 0 5 66.666667 ... NaN 34.181432 NaN NaN NaN NaN NaN 26078300 NaN 0.915068
16792 6 2018-05-27 5467518B1B0F 6536538 6536538 ch-mbcWNT301 RW07 0 9 100.000000 ... -3.132917 13.481536 1.152344 1.952344 -2.613542 0.980104 0.622396 25589372 NaN 0.780728
16876 0 2018-05-28 5467518A587B 6032633 6032633 mbcNAX302 629-648645-648 0 4 0.000000 ... -2.411771 25.353698 -3.030615 4.457990 4.102286 -3.523198 0.404109 26044465 NaN 0.841579
16883 6 2018-05-27 905C44194F50 946512 946512 ch-mbcFRR301 RW12 0 3 0.000000 ... -11.005729 18.122031 1.057604 4.580365 1.521927 0.555104 -1.148021 25746973 NaN 0.846900
16988 0 2018-05-28 AC220510637C 6070718 6070718 mbcBYB302 RW34 0 4 66.666667 ... -7.828542 17.751250 -4.874974 3.185651 -0.123385 3.215208 -1.484401 26354382 NaN 0.883945
19089 2 2018-06-06 905C44BF6396 6586897 6586897 mbcBIE301 RW113 0 3 16.666667 ... 0.280313 8.525547 -0.585964 0.120182 -0.764167 NaN NaN 29859091 NaN 0.787298
19269 3 2018-06-07 C427951777A7 6036968 6036968 ch-mbcMDR301 RW07 4 11 0.000000 ... -1.493125 8.181250 0.169740 -0.916823 -0.212839 -0.375443 -0.117637 30451867 NaN 0.765906
19742 6 2018-06-03 546751D4CBF9 6062076 6062076 ch-mbcSTW301 RW42 0 12 0.000000 ... -4.384167 21.648255 -0.972969 3.145391 0.432318 -0.142240 -0.133281 29211031 NaN 0.816550
20196 6 2018-06-10 546751F40E59 6219737 6219737 ch-mbcLOC301 RW25 0 17 16.666667 ... -0.969167 12.788516 NaN NaN NaN NaN NaN 31653587 NaN 0.778949
20254 5 2018-06-09 AC22059AC221 6623100 6623100 ch-mbcPFB301 RW26 0 5 100.000000 ... -6.940208 30.684505 -0.174583 -0.854453 0.125703 1.644141 -0.838021 31393578 NaN 0.844418
20405 5 2018-06-09 AC22056EE1A2 6628390 6628390 ch-mbcBRN303 RW175 0 3 100.000000 ... -2.765104 18.800104 2.956615 -0.393724 1.517344 -0.995833 0.269245 31376663 NaN 0.826967
20510 5 2018-06-09 38437D28041F 1968264 1968264 ch-mbcLMB301 RW12 0 12 100.000000 ... -6.453437 20.661328 1.387448 -2.036510 4.483229 1.680677 2.524557 31007246 NaN 0.849345
20563 5 2018-06-09 546751AF0E13 5714186 5714186 mbcBYB302 RW109 0 4 0.000000 ... -13.798125 15.173099 2.791016 -1.676589 0.030911 2.183177 -0.218229 31114886 NaN 0.780202
20619 4 2018-06-08 905C4440D33B 5274209 5274209 ch-mbcDEL301 RW62 0 2 100.000000 ... -4.351771 27.658359 -0.257474 2.089062 -1.493880 0.812187 -2.855469 30779602 NaN 0.799761

16 rows × 292 columns

Observation Results

We decided to check 4 different devices to see if they indeed experienced some problem:

  • 546751F40E59 ($p = 0.774174$): A HFC interface keeps resetting (a lot), during this period the customer has no connection. Probably a hardware modem issue. True Positive
  • AC22059AC221 ($p = 0.892083$): Went offline for an extended amount of time. The only problem seems to be a lot of missing values, there are a lot of timeouts and it might be a data collection issue. False Positive
  • C427951777A7 ($p = 0.821313$): Lots of packets are dropped on one of the channel, this can result for the customer in performance issues (service degradations). It is most probably a cabling problem that he can solve himself if we send him a repalcement cable. True Positive
  • 905C44BF6396 ($p = 0.783860$): We can see that levels were down for an extended period of time and went back up, which should indicate a technician intervention. We couldn't find it but given the similarities we can conclude that this is a True Positive

HFC support expressed a lot of satisfaction with respect to these results as out of the 4 predictions we analyzed only one was a false positive and it was still displaying irregularities.

Time gap influence

We got curious regarding what was the influence of the number of days that we waited between the training and prediction on the performance of the classifier.

Because we do not have that many days of data we will limit this analysis but the main idea is to see whether the model needs to be retrained every day or every 3 days, every week etc.

In [113]:
dates_to_consider = get_longest_date_seq(dates)
The longuest sequence of dates has length 41:
	S: 2018-04-08 00:00:00
	E: 2018-05-18 00:00:00

With 41 days to work with we now need to decide on a strategy to use such dates. We would like to see the influence of the gap period $\in\{0,...7\}$ (= difference between last day of training and first day of testing).

We want to keep the training set size a constant such that it doesn't influence the performance of the model

Also we would like to repeat the operation multiple times in order to have a representative metric in the end. What we will do is try to use 25% of the dataset to test for the largest gap length. Here we want to have 7 days at most of gap:

  • 8 different testing rounds (on 8 testing days)
  • a training set of size 26 days for each round and for all the gap length

Therefore when gap_length = 7, what we do is that the first training round will take:

  • training set: Day_1 to Day_26
  • testing set: Day_34 ... until the 8th round:
  • training set: Day_8 to Day_33
  • testing set: Day_41

and when gap_length = 1, the first round is:

  • training set: Day_1 to Day_26
  • testing set: Day_28 ... until the 8th round:
  • training set: Day_8 to Day_33
  • testing set: Day_35

Then in order to reduce the number of trainings that take time we see that for each gap_length:

  • At each round the training set is identical and only the testing set is different.
  • Therefore we can train a model as many time as we have different gap_length and evaluate on each training set.
  • we will concatenate for each gap length true_y and score_y that are computed at each round to have a good estimate of the performance.
In [114]:
plot_gap_performance(x_tr,y_tr,dates,clf,list(range(0,8)))
[############################################################] 100.0% (8/8) 
In [115]:
plot_gap_performance(x_tr,y_tr,dates,clf,[0,3,7,12])
[############################################################] 100.0% (7/7) 

We can see that there is no significant difference (apart from the variance due to the dataset, we see that we cannot observer a trend where the more days we have between training and predicting the more the performance decrease) between the performance of the models whatever the time gap between training and testing which is reassuring as it means that we have already taken out a good part of the seasonality of the data. But it also ensures that it is okay that the training set is in real life 8 days younger than the testing set (because of the FTR flag)

Same weekday influence

We were fairly curious regarding how our initial cross validation strategy was not estimating correctly the performance of our classifier. Our main guess was that the vectors still contained some sort of seasonality that we couldn't get rid of during the aggregation of the data.

Because we didn't have time to investigate deeply the problem we simply tried one thing which was to see if by training the model on the same week day than the day of testing we would get an overfitted model that would have strangely high performance

In [116]:
date_string = '11_06'
data_location = DATA_FOLDER
data_path = data_location + '/sample_'+date_string+'.xlsx'
backup_path = data_location+'/sample_'+date_string+'.pk'


extracted = import_sample(data_path,backup_path)
dates = extracted['day_0'].values
x_extracted, y_extracted = get_ml_data(extracted)
y = convert_to_binary_labels(y_extracted)
x_encoded = encode_categorical(x_extracted)
x_df = remove_features(x_encoded,verbose = True)
Retrieving from ./Data/sample_11_06_sick_only.pk
The sample is composed of : 18636 vectors of dimension 291
	n_sick		=   5733
	n_healthy	=  12903
Deleting 54 features (18.305%).
In [135]:
weekday_influence(x_df,y,dates,clf)
[############################################################] 100.0% (7/7) Creating subplots for each week day

What we observe is that the model is much less stable on weekends but it seems like during the rest of the week the performance of the classifier doesn't increase drastically and therefore we do not have a clear sign of overfitting on the weekdays.